Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decision about the HA model to be used with the s3gw with Longhorn #733

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

giubacc
Copy link

@giubacc giubacc commented Oct 2, 2023

Describe your changes

Add decision about the HA model to be used with the s3gw with Longhorn

Signed-off-by: Giuseppe Baccini [email protected]

Issue ticket number and link

Related to: https://github.com/aquarist-labs/s3gw/issues/361

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • CHANGELOG.md has been updated should there be relevant changes in this PR.

@giubacc giubacc self-assigned this Oct 2, 2023
@giubacc giubacc added area/kubernetes k8s and related kind/research Issues that need to be researched labels Oct 2, 2023
Copy link
Contributor

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the full details are in the research docs, but I wonder if we should add a very brief sentence to provide a little more detail about each mode? Maybe something like this (assuming I've managed to capture the essence):

  • Active/Active (multiple s3gw instances concurrently serving the same data)
  • Active/Warm Standby (multiple s3gw instances, one serving data, others able to take over if active instance fails)
  • Active/Standby (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

Otherwise, LGTM. I assume we'll update the PR links later to point to the research docs once that PR is merged.

@giubacc
Copy link
Author

giubacc commented Oct 2, 2023

I know the full details are in the research docs, but I wonder if we should add a very brief sentence to provide a little more detail about each mode? Maybe something like this (assuming I've managed to capture the essence):

  • Active/Active (multiple s3gw instances concurrently serving the same data)
  • Active/Warm Standby (multiple s3gw instances, one serving data, others able to take over if active instance fails)
  • Active/Standby (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)

Otherwise, LGTM. I assume we'll update the PR links later to point to the research docs once that PR is merged.

that makes perfectly sense, will change; moreover, pointing the PR is something not ideal probably, it will make more sense point the HA document directly once the corresponding PR will be merged.

docs/decisions/0015-s3gw-ha-model.md Outdated Show resolved Hide resolved
docs/decisions/0015-s3gw-ha-model.md Outdated Show resolved Hide resolved

The 3 HA models have different performances and different implementation efforts.
For our use case, the *Active/Standby* model built on top of Longhorn actually makes
the most sense and brings the "best" HA characteristics relative to implementing a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have a small bullet point list of what are the perceived "best" HA characteristics.

The final aim of the research is to identify an HA model we can use with the s3gw.

The full HA research work conducted until now has been proposed with the
[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) Pull request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either "Pull Request" or "pull request". I'm partial towards the latter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being I will go with all downcase; anyway, I'm not sure we should point the PR; perhaps better link the HA document once merged?

@giubacc giubacc force-pushed the s3gw-ha-model-adr branch from 8f917d6 to 4ad0cc2 Compare October 2, 2023 13:21
@jecluis jecluis requested a review from l-mb October 2, 2023 14:23
@l-mb
Copy link

l-mb commented Oct 5, 2023

A key part that we also need to address in our stack is the ingress layer, since we can hide many problems behind a proxy that stalls the clients, and the time of making it connect to a new backend affects our timings as well.

@giubacc
Copy link
Author

giubacc commented Oct 5, 2023

A key part that we also need to address in our stack is the ingress layer, since we can hide many problems behind a proxy that stalls the clients, and the time of making it connect to a new backend affects our timings as well.

Currently, with Traefik, we are getting a "NotFound: Not Found\n\tstatus code: 404, request id: , host id: " when the s3gw backend service is down.
Could be better to address Ingress specific investigations in a separate thread?
Maybe we can write in this ADR that the Ingress part needs a specific investigation effort and we reserve to do this with a separate dedicated issue/epic?

@irq0
Copy link
Contributor

irq0 commented Oct 5, 2023

Currently, with Traefik, we are getting a "NotFound: Not Found\n\tstatus code: 404, request id: , host id: " when the s3gw backend service is down.

That is not useful - clients won't retry a 404. 503 Service Unavailable would be a good option as clients like boto3 retry them https://github.com/boto/boto3/blob/develop/docs/source/guide/retries.rst

@l-mb
Copy link

l-mb commented Oct 5, 2023

Basically, if we push the ingress out - just like node failure recovery, or pre-loading the images - the actual ADR boils down to "s3gw needs to be able to be crash-consistent and start up fast after such an outage, because our active/cold standby model relies on the framework around us to achieve HA."

Then, this ADR would set a worst-case recovery goal (crashed under heavy load, say) of <1-5s for the s3gw restart part of the full stack.

That's fair (and mostly true), but then we need to immediately start on the next LEP to discuss how to achieve HA goals (RTO < 30s or some such goal) end-to-end. (And how to measure/validate those numbers as part of our E2E system testing, perhaps using k6?)

@giubacc
Copy link
Author

giubacc commented Oct 5, 2023

Basically, if we push the ingress out - just like node failure recovery, or pre-loading the images - the actual ADR boils down to "s3gw needs to be able to be crash-consistent and start up fast after such an outage, because our active/cold standby model relies on the framework around us to achieve HA."

Then, this ADR would set a worst-case recovery goal (crashed under heavy load, say) of <1-5s for the s3gw restart part of the full stack.

That's fair (and mostly true), but then we need to immediately start on the next LEP to discuss how to achieve HA goals (RTO < 30s or some such goal) end-to-end. (And how to measure/validate those numbers as part of our E2E system testing, perhaps using k6?)

I have the impression/feeling that the activity on the ingress thing (testing, researching modes that fit us, etc) is something that needs a proper allocation in terms of time at management level. That is something that must be agreed, especially now we are more committed in integrating with LH.

@jecluis
Copy link
Contributor

jecluis commented Oct 11, 2023

@l-mb is there anything else you'd like to see done here?

docs/decisions/0015-s3gw-ha-model.md Outdated Show resolved Hide resolved
@giubacc
Copy link
Author

giubacc commented Oct 25, 2023

@jecluis @l-mb ping

- Compatible with RWO persistent volume semantics
- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure)

Be aware that the [non-graceful node failure](https://github.com/aquarist-labs/s3gw/blob/4af657c573ce634cd16c53c20986e54817077b44/docs/research/ha/RATIONALE.md#non-graceful-node-failure) problem cannot be entirely solved with the *Active/Standby* model alone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea how this is passing linting. Regardless, mind fixing this line to a proper column size? Maybe move the link to a ref at the end of the document to mitigate the noise?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've instead linked the now available HA document directly with its natural project path.
That is much more short and it looks definitely better.

@jecluis jecluis merged commit b0f122e into s3gw-tech:main Oct 25, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes k8s and related kind/research Issues that need to be researched
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants