Skip to content

Commit

Permalink
Add decision about the HA model to be used with the s3gw with Longhorn
Browse files Browse the repository at this point in the history
Related to: https://github.com/aquarist-labs/s3gw/issues/361
Signed-off-by: Giuseppe Baccini <[email protected]>
  • Loading branch information
Giuseppe Baccini committed Oct 16, 2023
1 parent fb989e1 commit 78e0239
Showing 1 changed file with 39 additions and 0 deletions.
39 changes: 39 additions & 0 deletions docs/decisions/0018-s3gw-ha-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# s3gw High Availability model

## Context and Problem Statement

We analyzed some High Availability - HA - concepts applied to the s3gw when used with Longhorn.
The final aim of the research is to identify an HA model we can reasonably rely on.

The full HA research work conducted until now can be found here:
[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685).
You can find there all the rationales, motivations and the details about the tests performed.

## Considered Options

We identified 3 HA models:

- **Active/Active** (multiple s3gw instances concurrently serving the same data)
- **Active/Warm Standby** (multiple s3gw instances, one serving data, others able to take over if active instance fails)
- **Active/Standby** (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

## Decision Outcome

The 3 aforementioned models have different performances and different implementation efforts.
For our use case, the *Active/Standby* model built on top of Longhorn actually makes
the most sense and brings the "best" HA characteristics relative to implementing a
more fully active/distributed solution.

List of *desirable* characteristics owned by the *Active/Standby* model

- Simplicity
- Low implementation effort in respect to the other models
- Expected to work mainly with Kubernetes primitives
- Compatible with RWO persistent volume semantics
- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure)

Be aware that the [non-graceful node failure](https://github.com/aquarist-labs/s3gw/blob/4af657c573ce634cd16c53c20986e54817077b44/docs/research/ha/RATIONALE.md#non-graceful-node-failure) problem cannot be entirely solved with the *Active/Standby* model alone.
Regarding this, we have opened a [dedicated issue](https://github.com/longhorn/longhorn/issues/6803) within the Longhorn project.

For a more comprehensive explanation about this choice, please refer to the
[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) pull request.

0 comments on commit 78e0239

Please sign in to comment.