From 3de36c82af85480fd5accfcb0de168aa905ecc98 Mon Sep 17 00:00:00 2001 From: Giuseppe Baccini Date: Mon, 2 Oct 2023 11:50:28 +0200 Subject: [PATCH] Add decision about the HA model to be used with the s3gw with Longhorn Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini --- docs/decisions/0018-s3gw-ha-model.md | 41 ++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 docs/decisions/0018-s3gw-ha-model.md diff --git a/docs/decisions/0018-s3gw-ha-model.md b/docs/decisions/0018-s3gw-ha-model.md new file mode 100644 index 00000000..62c12577 --- /dev/null +++ b/docs/decisions/0018-s3gw-ha-model.md @@ -0,0 +1,41 @@ +# s3gw High Availability model + +## Context and Problem Statement + +We analyzed some High Availability - HA - concepts applied to the s3gw when used with Longhorn. +The final aim of the research is to identify an HA model we can reasonably rely on. + +The full and initial HA research document can be found here: +[High Availability research](../research/ha/RATIONALE.md). +You can find there all the rationales, motivations and the details about the tests performed. + +## Considered Options + +We identified 3 HA models: + +- **Active/Active** (multiple s3gw instances concurrently serving the same data) +- **Active/Warm Standby** (multiple s3gw instances, one serving data, others able to take over if active instance fails) +- **Active/Standby** (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure) + +## Decision Outcome + +The 3 aforementioned models have different performances and different implementation efforts. +For our use case, the *Active/Standby* model built on top of Longhorn actually makes +the most sense and brings the "best" HA characteristics relative to implementing a +more fully active/distributed solution. + +List of *desirable* characteristics owned by the *Active/Standby* model + +- Simplicity +- Low implementation effort in respect to the other models +- Expected to work mainly with Kubernetes primitives +- Compatible with RWO persistent volume semantics +- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure) + +Be aware that the [non-graceful node failure](../research/ha/RATIONALE.md#non-graceful-node-failure) +problem cannot be entirely solved with the *Active/Standby* model alone. +Regarding this, we have opened a [dedicated issue](https://github.com/longhorn/longhorn/issues/6803) +within the Longhorn project. + +For a more comprehensive explanation about this choice, please refer to the original +[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) pull request.