Add decision about the HA model to be used with the s3gw with Longhorn #733

giubacc · 2023-10-02T09:54:14Z

Describe your changes

Add decision about the HA model to be used with the s3gw with Longhorn

Signed-off-by: Giuseppe Baccini [email protected]

Issue ticket number and link

Related to: https://github.com/aquarist-labs/s3gw/issues/361

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
CHANGELOG.md has been updated should there be relevant changes in this PR.

tserong

I know the full details are in the research docs, but I wonder if we should add a very brief sentence to provide a little more detail about each mode? Maybe something like this (assuming I've managed to capture the essence):

Active/Active (multiple s3gw instances concurrently serving the same data)
Active/Warm Standby (multiple s3gw instances, one serving data, others able to take over if active instance fails)
Active/Standby (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

Otherwise, LGTM. I assume we'll update the PR links later to point to the research docs once that PR is merged.

giubacc · 2023-10-02T10:29:29Z

I know the full details are in the research docs, but I wonder if we should add a very brief sentence to provide a little more detail about each mode? Maybe something like this (assuming I've managed to capture the essence):

Active/Active (multiple s3gw instances concurrently serving the same data)

Active/Warm Standby (multiple s3gw instances, one serving data, others able to take over if active instance fails)

Active/Standby (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

Otherwise, LGTM. I assume we'll update the PR links later to point to the research docs once that PR is merged.

that makes perfectly sense, will change; moreover, pointing the PR is something not ideal probably, it will make more sense point the HA document directly once the corresponding PR will be merged.

docs/decisions/0015-s3gw-ha-model.md

jecluis · 2023-10-02T11:46:58Z

docs/decisions/0015-s3gw-ha-model.md

+
+The 3 HA models have different performances and different implementation efforts.
+For our use case, the *Active/Standby* model built on top of Longhorn actually makes
+the most sense and brings the "best" HA characteristics relative to implementing a


would be nice to have a small bullet point list of what are the perceived "best" HA characteristics.

jecluis · 2023-10-02T11:47:48Z

docs/decisions/0015-s3gw-ha-model.md

+The final aim of the research is to identify an HA model we can use with the s3gw.
+
+The full HA research work conducted until now has been proposed with the
+[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) Pull request.


either "Pull Request" or "pull request". I'm partial towards the latter.

For the time being I will go with all downcase; anyway, I'm not sure we should point the PR; perhaps better link the HA document once merged?

docs/decisions/0015-s3gw-ha-model.md

l-mb · 2023-10-05T08:56:06Z

A key part that we also need to address in our stack is the ingress layer, since we can hide many problems behind a proxy that stalls the clients, and the time of making it connect to a new backend affects our timings as well.

giubacc · 2023-10-05T09:28:37Z

A key part that we also need to address in our stack is the ingress layer, since we can hide many problems behind a proxy that stalls the clients, and the time of making it connect to a new backend affects our timings as well.

Currently, with Traefik, we are getting a "NotFound: Not Found\n\tstatus code: 404, request id: , host id: " when the s3gw backend service is down.
Could be better to address Ingress specific investigations in a separate thread?
Maybe we can write in this ADR that the Ingress part needs a specific investigation effort and we reserve to do this with a separate dedicated issue/epic?

irq0 · 2023-10-05T12:55:38Z

Currently, with Traefik, we are getting a "NotFound: Not Found\n\tstatus code: 404, request id: , host id: " when the s3gw backend service is down.

That is not useful - clients won't retry a 404. 503 Service Unavailable would be a good option as clients like boto3 retry them https://github.com/boto/boto3/blob/develop/docs/source/guide/retries.rst

l-mb · 2023-10-05T13:23:18Z

Basically, if we push the ingress out - just like node failure recovery, or pre-loading the images - the actual ADR boils down to "s3gw needs to be able to be crash-consistent and start up fast after such an outage, because our active/cold standby model relies on the framework around us to achieve HA."

Then, this ADR would set a worst-case recovery goal (crashed under heavy load, say) of <1-5s for the s3gw restart part of the full stack.

That's fair (and mostly true), but then we need to immediately start on the next LEP to discuss how to achieve HA goals (RTO < 30s or some such goal) end-to-end. (And how to measure/validate those numbers as part of our E2E system testing, perhaps using k6?)

giubacc · 2023-10-05T15:57:42Z

Basically, if we push the ingress out - just like node failure recovery, or pre-loading the images - the actual ADR boils down to "s3gw needs to be able to be crash-consistent and start up fast after such an outage, because our active/cold standby model relies on the framework around us to achieve HA."

Then, this ADR would set a worst-case recovery goal (crashed under heavy load, say) of <1-5s for the s3gw restart part of the full stack.

That's fair (and mostly true), but then we need to immediately start on the next LEP to discuss how to achieve HA goals (RTO < 30s or some such goal) end-to-end. (And how to measure/validate those numbers as part of our E2E system testing, perhaps using k6?)

I have the impression/feeling that the activity on the ingress thing (testing, researching modes that fit us, etc) is something that needs a proper allocation in terms of time at management level. That is something that must be agreed, especially now we are more committed in integrating with LH.

jecluis · 2023-10-11T22:31:17Z

@l-mb is there anything else you'd like to see done here?

docs/decisions/0015-s3gw-ha-model.md

giubacc · 2023-10-25T10:10:52Z

@jecluis @l-mb ping

jecluis · 2023-10-25T12:51:49Z

docs/decisions/0018-s3gw-ha-model.md

+- Compatible with RWO persistent volume semantics
+- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure)
+
+Be aware that the [non-graceful node failure](https://github.com/aquarist-labs/s3gw/blob/4af657c573ce634cd16c53c20986e54817077b44/docs/research/ha/RATIONALE.md#non-graceful-node-failure) problem cannot be entirely solved with the *Active/Standby* model alone.


no idea how this is passing linting. Regardless, mind fixing this line to a proper column size? Maybe move the link to a ref at the end of the document to mitigate the noise?

I've instead linked the now available HA document directly with its natural project path.
That is much more short and it looks definitely better.

Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <[email protected]>

giubacc self-assigned this Oct 2, 2023

giubacc added area/kubernetes k8s and related kind/research Issues that need to be researched labels Oct 2, 2023

tserong reviewed Oct 2, 2023

View reviewed changes

jecluis requested changes Oct 2, 2023

View reviewed changes

jecluis reviewed Oct 2, 2023

View reviewed changes

docs/decisions/0015-s3gw-ha-model.md Outdated Show resolved Hide resolved

giubacc force-pushed the s3gw-ha-model-adr branch from 8f917d6 to 4ad0cc2 Compare October 2, 2023 13:21

jecluis requested a review from l-mb October 2, 2023 14:23

jecluis requested changes Oct 13, 2023

View reviewed changes

docs/decisions/0015-s3gw-ha-model.md Outdated Show resolved Hide resolved

giubacc force-pushed the s3gw-ha-model-adr branch from 4ad0cc2 to 78e0239 Compare October 16, 2023 12:21

giubacc mentioned this pull request Oct 25, 2023

Document current HA model (Epic) #361

Open

4 tasks

jecluis requested changes Oct 25, 2023

View reviewed changes

Add decision about the HA model to be used with the s3gw with Longhorn

3de36c8

Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <[email protected]>

giubacc force-pushed the s3gw-ha-model-adr branch from 78e0239 to 3de36c8 Compare October 25, 2023 13:35

jecluis approved these changes Oct 25, 2023

View reviewed changes

jecluis merged commit b0f122e into s3gw-tech:main Oct 25, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decision about the HA model to be used with the s3gw with Longhorn #733

Add decision about the HA model to be used with the s3gw with Longhorn #733

giubacc commented Oct 2, 2023

tserong left a comment

giubacc commented Oct 2, 2023 •

edited

Loading

jecluis Oct 2, 2023

jecluis Oct 2, 2023

giubacc Oct 2, 2023

l-mb commented Oct 5, 2023

giubacc commented Oct 5, 2023

irq0 commented Oct 5, 2023

l-mb commented Oct 5, 2023

giubacc commented Oct 5, 2023

jecluis commented Oct 11, 2023

giubacc commented Oct 25, 2023

jecluis Oct 25, 2023

giubacc Oct 25, 2023

Add decision about the HA model to be used with the s3gw with Longhorn #733

Add decision about the HA model to be used with the s3gw with Longhorn #733

Conversation

giubacc commented Oct 2, 2023

Describe your changes

Issue ticket number and link

Checklist before requesting a review

tserong left a comment

Choose a reason for hiding this comment

giubacc commented Oct 2, 2023 • edited Loading

jecluis Oct 2, 2023

Choose a reason for hiding this comment

jecluis Oct 2, 2023

Choose a reason for hiding this comment

giubacc Oct 2, 2023

Choose a reason for hiding this comment

l-mb commented Oct 5, 2023

giubacc commented Oct 5, 2023

irq0 commented Oct 5, 2023

l-mb commented Oct 5, 2023

giubacc commented Oct 5, 2023

jecluis commented Oct 11, 2023

giubacc commented Oct 25, 2023

jecluis Oct 25, 2023

Choose a reason for hiding this comment

giubacc Oct 25, 2023

Choose a reason for hiding this comment

giubacc commented Oct 2, 2023 •

edited

Loading