-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decision about the HA model to be used with the s3gw with Longhorn #733
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the full details are in the research docs, but I wonder if we should add a very brief sentence to provide a little more detail about each mode? Maybe something like this (assuming I've managed to capture the essence):
- Active/Active (multiple s3gw instances concurrently serving the same data)
- Active/Warm Standby (multiple s3gw instances, one serving data, others able to take over if active instance fails)
- Active/Standby (single s3gw instance, with Kubernetes restarting/redeploying as necessary on failure)
Otherwise, LGTM. I assume we'll update the PR links later to point to the research docs once that PR is merged.
that makes perfectly sense, will change; moreover, pointing the PR is something not ideal probably, it will make more sense point the HA document directly once the corresponding PR will be merged. |
docs/decisions/0015-s3gw-ha-model.md
Outdated
|
||
The 3 HA models have different performances and different implementation efforts. | ||
For our use case, the *Active/Standby* model built on top of Longhorn actually makes | ||
the most sense and brings the "best" HA characteristics relative to implementing a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have a small bullet point list of what are the perceived "best" HA characteristics.
docs/decisions/0015-s3gw-ha-model.md
Outdated
The final aim of the research is to identify an HA model we can use with the s3gw. | ||
|
||
The full HA research work conducted until now has been proposed with the | ||
[High Availability research](https://github.com/aquarist-labs/s3gw/pull/685) Pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either "Pull Request" or "pull request". I'm partial towards the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being I will go with all downcase; anyway, I'm not sure we should point the PR; perhaps better link the HA document once merged?
8f917d6
to
4ad0cc2
Compare
A key part that we also need to address in our stack is the ingress layer, since we can hide many problems behind a proxy that stalls the clients, and the time of making it connect to a new backend affects our timings as well. |
Currently, with Traefik, we are getting a "NotFound: Not Found\n\tstatus code: 404, request id: , host id: " when the s3gw backend service is down. |
That is not useful - clients won't retry a 404. 503 Service Unavailable would be a good option as clients like boto3 retry them https://github.com/boto/boto3/blob/develop/docs/source/guide/retries.rst |
Basically, if we push the ingress out - just like node failure recovery, or pre-loading the images - the actual ADR boils down to "s3gw needs to be able to be crash-consistent and start up fast after such an outage, because our active/cold standby model relies on the framework around us to achieve HA." Then, this ADR would set a worst-case recovery goal (crashed under heavy load, say) of <1-5s for the s3gw restart part of the full stack. That's fair (and mostly true), but then we need to immediately start on the next LEP to discuss how to achieve HA goals (RTO < 30s or some such goal) end-to-end. (And how to measure/validate those numbers as part of our E2E system testing, perhaps using k6?) |
I have the impression/feeling that the activity on the ingress thing (testing, researching modes that fit us, etc) is something that needs a proper allocation in terms of time at management level. That is something that must be agreed, especially now we are more committed in integrating with LH. |
@l-mb is there anything else you'd like to see done here? |
4ad0cc2
to
78e0239
Compare
docs/decisions/0018-s3gw-ha-model.md
Outdated
- Compatible with RWO persistent volume semantics | ||
- Acceptable restart timings on switch-overs and fail-overs (excluding the non-graceful node failure) | ||
|
||
Be aware that the [non-graceful node failure](https://github.com/aquarist-labs/s3gw/blob/4af657c573ce634cd16c53c20986e54817077b44/docs/research/ha/RATIONALE.md#non-graceful-node-failure) problem cannot be entirely solved with the *Active/Standby* model alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea how this is passing linting. Regardless, mind fixing this line to a proper column size? Maybe move the link to a ref at the end of the document to mitigate the noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've instead linked the now available HA document directly with its natural project path.
That is much more short and it looks definitely better.
Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <[email protected]>
78e0239
to
3de36c8
Compare
Describe your changes
Add decision about the HA model to be used with the s3gw with Longhorn
Signed-off-by: Giuseppe Baccini [email protected]
Issue ticket number and link
Related to: https://github.com/aquarist-labs/s3gw/issues/361
Checklist before requesting a review