diff --git a/docs/research/ha/RATIONALE.md b/docs/research/ha/RATIONALE.md new file mode 100644 index 00000000..e4967a04 --- /dev/null +++ b/docs/research/ha/RATIONALE.md @@ -0,0 +1,446 @@ +# High Availability with s3gw + +- [High Availability with s3gw](#high-availability-with-s3gw) + - [Active-Active pipelines](#active-active-pipelines) + - [Active-Passive pipelines](#active-passive-pipelines) + - [One Active pipeline](#one-active-pipeline) + - [Investigation's Rationale](#investigations-rationale) + - [Failure cases](#failure-cases) + - [radosgw's POD failure](#radosgws-pod-failure) + - [Cluster's node failure](#clusters-node-failure) + - [radosgw's failure due to a bug not related to a certain input pattern](#radosgws-failure-due-to-a-bug-not-related-to-a-certain-input-pattern) + - [radosgw's failure due to a bug related to a certain input pattern](#radosgws-failure-due-to-a-bug-related-to-a-certain-input-pattern) + - [PV Data corruption at application level due to radosgw's bug](#pv-data-corruption-at-application-level-due-to-radosgws-bug) + - [PV Data corruption at application level due to radosgw's anomalous exit](#pv-data-corruption-at-application-level-due-to-radosgws-anomalous-exit) + - [Measuring s3gw failures on Kubernetes](#measuring-s3gw-failures-on-kubernetes) + - [The s3gw Probe](#the-s3gw-probe) + - [Tested scenarios](#tested-scenarios) + - [EXIT-1, 10 measures](#exit-1-10-measures) + - [EXIT-0, 10 measures](#exit-0-10-measures) + - [Considerations on the tests performed](#considerations-on-the-tests-performed) + +We want to investigate what *High Availability* - HA - means for a project like +the s3gw. + +If we identify the meaning of HA as the ability to have N independent pipelines +so that the failure of some of these does not affect the user's operations, +this is something that could be not easy to achieve with the s3gw. + +With pipeline we mean all the chain from the ingress to the `PV`: + +```mermaid +flowchart LR + ingress(ingress) ==> radosgw(radosgw) ==> PV((PV)) +``` + +The difficulty with a project like the s3gw is due to the fact that basically +we have one process: the `radosgw`, that has the *exclusive* access to one resource: +the Kubernetes `PV` where the S3 buckets and objects are persisted. + +Speaking in theory, 3 HA models are possible with s3gw: + +1. **Active-Active** pipelines +2. **Active-Passive** pipelines +3. **One Active** pipeline (no Passive pipelines preloaded) + +## Active-Active pipelines + +This is the model that guarantees the best HA characteristics. +The *Active-Active* model must implement *true independent pipelines* +where all the pieces are replicated. + +The immediate consequence of this statement is that every pipeline being part of +the same logical s3gw service must bind to a different `PV`; one per `radosgw` process. +All the `PV`s for the same logical s3gw, must therefore, sync their data. + +This implies that with this model, the s3gw must implement an efficient replication +model across all the pipelines being part of the same logical service. + +Under these conditions, a client accessing the S3 service would experience a failure +only for *the active operation* that were happening on the failed s3gw pipeline. + +```mermaid +flowchart LR + client(client) ==> ingress + ingress(ingress) ==> radosgw(radosgw) ==> PV((PV)) + ingress ==> radosgw2(radosgw) ==> PV2((PV)) + + linkStyle 0 stroke: red + linkStyle 1 stroke: red + linkStyle 2 stroke: red +``` + +The subsequent access to the S3 service would result in a successful operation +because the load balancer would immediately reroute to another active pipeline. + +```mermaid +flowchart LR + client(client) ==> ingress + ingress(ingress) ==> radosgw(radosgw) ==> PV((PV)) + ingress ==> radosgw2(radosgw) ==> PV2((PV)) + + linkStyle 0 stroke: green + linkStyle 1 stroke: red + linkStyle 2 stroke: red + linkStyle 3 stroke: green + linkStyle 4 stroke: green +``` + +This model is potentially hugely expensive for the s3gw project +because there isn't an implementation for the data replication with the SFS backend. + +## Active-Passive pipelines + +This model assumes N pipelines to be "allocated" on the cluster at the same time, +but only one, the *active* pipeline, owns the *exclusive ownership* +over the shared `PV`. + +Should the active pipeline suffer a failure, the next elected active pipeline, +chosen to replace the old one, should pay a *time window* necessary to "transfer" +the ownership over the shared `PV`. + +This means that, unlike the *Active-Active* solution, it is not guaranteed that +the subsequent access made by a client after a failed one is always successful. + +It can happen that a client issues an access to the S3 service while the system is +translating the passive pipeline into the active state; in this case, the issued +access is expected to fail. + +```mermaid +flowchart LR + client(client) ==> ingress + ingress(ingress) ==> radosgw(radosgw) ==> PV((PV)) + ingress ==> radosgw2(radosgw) + + linkStyle 0 stroke: red + linkStyle 1 stroke: red + linkStyle 2 stroke: red +``` + +```mermaid +flowchart LR + client(client) ==> ingress + ingress(ingress) ==> radosgw(radosgw) + ingress ==> radosgw2(radosgw) == ownership transfer ==> PV((PV)) + + linkStyle 0 stroke: red + linkStyle 1 stroke: white + linkStyle 2 stroke: red +``` + +```mermaid +flowchart LR + client(client) ==> ingress + ingress(ingress) ==> radosgw(radosgw) + ingress ==> radosgw2(radosgw) ==> PV((PV)) + + linkStyle 0 stroke: green + linkStyle 1 stroke: white + linkStyle 2 stroke: green + linkStyle 3 stroke: green +``` + +## One Active pipeline + +This model is logically equivalent to the *Active-Passive* model with the difference +that only one pipeline is allocated on the cluster at a time. + +In the *One pipeline* model, the only notable perceived difference between +the *Active-Passive* model are the worst performances obtained in the event of failure. + +In this scenario, the system should pay also the time needed to fully load a new +pipeline, that unlike that *Active-Passive* model, is not preloaded. + +So, supposing that the new pipeline has scheduled to load on a node where the +s3gw image is not cached, the system should pay the time needed to download +the image from the registry before starting it. + +## Investigation's Rationale + +The 3 models described above have different performances and different implementation +efforts. While the *Active-Active* model is expected to require a significant +development effort due to its inherent complex nature, the *Passive-Active* model +seems to offer a lower degree of implementation difficulty. + +In particular, the *One Active pipeline* model, is expected to work with nothing +but the primitives normally available on any Kubernetes cluster. + +Given that the backend s3gw architecture is composed by: + +- one *stateless* `radosgw` process associated with a *stateful* `PV` + +it is supposed that, on `radosgw` process failure, a simple reschedule +of its POD is enough to fulfill this simple HA model. +A clarification is needed here: the `PV` *state* is guaranteed by a third party +service, that is: **Longhorn**. + +Longhorn ensures that a `PV` (along with its content) is always available on +any cluster's node, so that, a POD can mount it regardless of its allocation on the +cluster. + +Relying on this assumption, while we are still in the process to decide what model +we want to actually engage, the *One Active pipeline* model seems for sure +the easiest thing to explore and to investigate. + +## Failure cases + +The `PV` state is kept coherent by Longhorn, so errors at this level are assumed +NOT possible (application level corruptions to the `PV`'s state ARE possible). + +What are the failure cases that can happen for s3gw? + +Let's examine the following scenarios: + +1. `radosgw`'s POD failure or rescheduling +2. Cluster's node failure +3. `radosgw`'s failure due to a bug not related to a certain input pattern +4. `radosgw`'s failure due to a bug related to a certain input pattern +5. `PV` Data corruption at application level due to `radosgw`'s bug +6. `PV` Data corruption at application level due to `radosgw`'s anomalous exit + +We are supposing all these bugs or conditions to be fatal for the s3gw's process +so that they trigger an anomalous exit. + +### radosgw's POD failure + +This case is when the `radosgw` process stops due to a failure of its POD. +This case applies also when Kubernetes decides to reschedule the POD +to another node, eg: when the node goes low on resources. + +This can be thought as an infrastructure issue independent to the s3gw. +In this case, the *One Active pipeline* model fully restores the service by +rescheduling a new POD somewhere in the cluster. + +### Cluster's node failure + +This case is when the `radosgw` process stops due to a cluster's node failure. +The *One Active pipeline* model fully restores the service by +rescheduling a new POD somewhere in the cluster. + +### radosgw's failure due to a bug not related to a certain input pattern + +This case is when the `radosgw` process crashes due to a bug not directly +related to any input type. + +Examples: + +- Memory leaks +- Memory corruptions (stack or heap corruptions) +- Periodic operations or routines not related to an input (GC, measures, + telemetry, etc) + +For a bug like this, the *One Active pipeline* model could guarantee +the user's operations *until the next occurrence* of the same malfunctioning. + +A definitive solution would be available only when a patch for the issue +has released. + +### radosgw's failure due to a bug related to a certain input pattern + +This case is when the `radosgw` process crashes due to a bug directly +related to a certain input type. + +Examples: + +- Putting Buckets with some name's pattern +- Putting Objects that have a certain size +- Performing an admin operation over a suspended user + +For a bug like this, the *One Active pipeline* model could guarantee +the user's operations under the condition that the crash-triggering +input is recognized by the user and thus its submission blocked. + +### PV Data corruption at application level due to radosgw's bug + +This case is when the state on the `PV` corrupts due to a `radosgw` +bug. + +In this unfortunate scenario, the *One Active pipeline* can hardly help. +A restart could evenly fix the problem or trigger an endless restarting loop. + +The fix for this could contemplate an human intervention. +A definitive solution would be available only when a patch for the issue +is available. + +### PV Data corruption at application level due to radosgw's anomalous exit + +This case is when the state on the PV corrupts due to a `radosgw` +anomalous exit, eg: after a node failure. + +This scenario could even be NOT possible because of the safety features +implemented on the radosgw SFS backend. + +The fix for this could contemplate an human intervention. + +## Measuring s3gw failures on Kubernetes + +After reviewing the cases, we can say that what can be actually solved with +an HA model with s3gw is when the failure is not dependent to applicative bugs. +We can handle temporary issues involving the infrastructure that is hosting +the `radosgw` process. + +So, we are mostly interested in observing and measuring the timing of +Kubernetes when a `radosgw`'s POD is being rescheduled. + +Hence, The idea to collect measures regarding a series of restarts +artificially triggered on the `radosgw`'s POD. +Obtaining such measures would allow to compute some arbitrary statistics +for the time required by Kubernetes to restart the `radosgw`'s POD. + +## The s3gw Probe + +The `s3gw Probe` is a program developed with the purpose of collecting restart +events coming from the `radosgw` process. +The tool is acting as a client/server service inside the Kubernetes cluster. + +- It acts as client vs the `radosgw` process requesting it to die. +- It acts as server of `radosgw` process collecting its `death` and `start` events. +- It acts as server of the user's client accepting configurations of restart + scenarios to be triggered against the `radosgw` process. +- It acts as server of the user's client returning statistics over the collected + data. + +In nutshell: + +- The `s3gw Probe` can be instructed to trigger `radosgw` restarts + with a RESTful call. +- The `s3gw Probe` can be queried for statistics over the collected data. + +The usage sequence for the s3gw Probe is the following: + +- The user instructs the tool + +```mermaid +flowchart LR + user[curl] == setup & trigger restarts ==> probe(s3gw probe) + + linkStyle 0 stroke: green +``` + +- The tool performs the `die request` cycle collecting the `death` and `start` + events from the `radosgw` process. + +```mermaid +flowchart LR + probe(s3gw probe) == die request #1 ==> radosgw(radosgw) + == death notice #1 ==> probe + + linkStyle 0 stroke: red + linkStyle 1 stroke: blue +``` + +```mermaid +flowchart LR + radosgw(radosgw) == start notice #1 ==> probe(s3gw probe) + probe(s3gw probe) == collect restart event #1 ==> probe(s3gw probe) + + linkStyle 0 stroke: green + linkStyle 1 stroke: blue +``` + +- The user queries the tool for the statistics + +```mermaid +flowchart LR + user[curl] == query statistics ==> probe(s3gw probe) + == statistics ==> user + + linkStyle 0 stroke: green + linkStyle 1 stroke: blue +``` + +## Tested scenarios + +As previously said, we want to compute some statistics regarding the Kubernetes +performances when restarting the `radosgw`'s POD. +The `radosgw` code has been patched to accept a REST call from a client +where the user can specify the way the `radosgw` will exit. +Currently, 3 modes are possible: + +- `EXIT_0` +- `EXIT_1` +- `CORE_BY_SEG_FAULT` + +### EXIT-1, 10 measures + +```yaml + - mark: exit1 + series: + - restart_id: 1 + duration: 0 + - restart_id: 2 + duration: 15 + - restart_id: 3 + duration: 24 + - restart_id: 4 + duration: 42 + - restart_id: 5 + duration: 91 + - restart_id: 6 + duration: 174 + - restart_id: 7 + duration: 307 + - restart_id: 8 + duration: 302 + - restart_id: 9 + duration: 302 + - restart_id: 10 + duration: 304 + time_unit: s +``` + +

chatterbox-logo +

+ +### EXIT-0, 10 measures + +```yaml + - mark: exit0 + series: + - restart_id: 1 + duration: 1 + - restart_id: 2 + duration: 13 + - restart_id: 3 + duration: 25 + - restart_id: 4 + duration: 49 + - restart_id: 5 + duration: 91 + - restart_id: 6 + duration: 161 + - restart_id: 7 + duration: 302 + - restart_id: 8 + duration: 304 + - restart_id: 9 + duration: 305 + - restart_id: 10 + duration: 308 +time_unit: s +``` + +

chatterbox-logo +

+ +### Considerations on the tests performed + +While the Kubernetes's behavior for `EXIT-1` was the expected one, the writer +was expecting a different behavior for the `EXIT-0` test. + +It is reasonable for Kubernetes to put a POD whose process exits with a code +different than zero into `CrashLoopBackoff` state; it is less obvious that the +same state is applied also for a POD whose process exits "normally". +A clarification: Kubernetes puts a POD exited with `0` into `Completed` state, +but if that POD is rescheduled, the state becomes `CrashLoopBackoff` and thus +the behavioral schema is the same as if the process run by the POD exited +abnormally. + +About this behavior, there is actually an opened request to make the +[CrashLoopBackoff timing tuneable](https://github.com/kubernetes/kubernetes/issues/57291), +at least for the cases when the process exits with zero. + +Anyway, this behavior limits the number of measures we can collect and thus is +preventing us to compute decent statistics on restart timings. +Currently, we can say that a `radosgw` POD restarting on the same node is taking +~1 second. diff --git a/docs/research/ha/assets/1node-10restarts-exit0.png b/docs/research/ha/assets/1node-10restarts-exit0.png new file mode 100644 index 00000000..34e73e70 Binary files /dev/null and b/docs/research/ha/assets/1node-10restarts-exit0.png differ diff --git a/docs/research/ha/assets/1node-10restarts-exit1.png b/docs/research/ha/assets/1node-10restarts-exit1.png new file mode 100644 index 00000000..b7ff7eae Binary files /dev/null and b/docs/research/ha/assets/1node-10restarts-exit1.png differ