Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decision about the HA model to be used with the s3gw with Longhorn #733

Merged
merged 1 commit into from
Oct 25, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/decisions/0018-s3gw-ha-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# s3gw High Availability model

## Context and Problem Statement

We analyzed some High Availability - HA - concepts applied to the s3gw when used with Longhorn.
The final aim of the research is to identify an HA model we can reasonably rely on.

The full and initial HA research document can be found here:
[High Availability research](../research/ha/RATIONALE.md).
You can find there all the rationales, motivations and the details about the tests performed.

## Considered Options

We identified 3 HA models:

- **Active/Active** (multiple s3gw instances concurrently serving the same data)
- **Active/Warm Standby** (multiple s3gw instances, one serving data, others able to take over if active instance fails)
- **Active/Standby** (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

## Decision Outcome

The 3 aforementioned models have different performances and different implementation efforts.
For our use case, the *Active/Standby* model built on top of Longhorn actually makes
the most sense and brings the "best" HA characteristics relative to implementing a
more fully active/distributed solution.

List of *desirable* characteristics owned by the *Active/Standby* model

- Simplicity
- Low implementation effort in respect to the other models
- Expected to work mainly with Kubernetes primitives
- Compatible with RWO persistent volume semantics
- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure)

Be aware that the [non-graceful node failure](../research/ha/RATIONALE.md#non-graceful-node-failure)
problem cannot be entirely solved with the *Active/Standby* model alone.
Regarding this, we have opened a [dedicated issue](https://github.com/longhorn/longhorn/issues/6803)
within the Longhorn project.

For a more comprehensive explanation about this choice, please refer to the original
[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) pull request.
Loading