Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sidekiq backoff service to handle OS Places outages #272

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

KludgeKML
Copy link
Contributor

@KludgeKML KludgeKML commented Sep 19, 2023

  • When OS Places has a slowdown or outage, we can avoid our own alerting problems by backing off calls (since they're not time-critical).
  • Add a service which is initialised on startup (relevant mainly to worker processes) which adjusts the scheduled interval of PostcodeProcessWorker creation (currently once per second). We record OS Places API failures, and with each failure we double the interval, until we reach a max of 180s. When we record a successful call, we reduce the interval by 1s, so it will quickly back off if many errors occur, and slowly creep back to full speed when the errors are over.

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

https://trello.com/c/8P4LRqvl/2171-investigate-how-we-can-handle-os-places-api-being-down

@KludgeKML KludgeKML force-pushed the sidekiq-slow-down-on-error branch 2 times, most recently from 752c651 to f5d6713 Compare October 4, 2023 11:58
@KludgeKML KludgeKML force-pushed the sidekiq-slow-down-on-error branch 3 times, most recently from efe704f to 395bd75 Compare October 10, 2023 14:04
@KludgeKML KludgeKML marked this pull request as ready for review October 12, 2023 12:40
@KludgeKML KludgeKML force-pushed the sidekiq-slow-down-on-error branch from 1f9c839 to 6f9a7d0 Compare October 12, 2023 14:00
@KludgeKML KludgeKML requested a review from 1pretz1 October 12, 2023 14:01
Copy link
Contributor

@1pretz1 1pretz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach seems reasonable, nice work! One minor comment but not a blocker


def record_success
initial_interval = current_interval
target_interval = [initial_interval - 1, @min_interval].max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use attr_reader vs calling instance vars direct

- When OS Places has a slowdown or outage, we can avoid our own alerting problems by backing off calls (since they're not time-critical).
- Add a service which is initialised on startup (relevant mainly to worker processes) which adjusts the scheduled interval of PostcodeProcessWorker creation (currently once per second). We record OS Places API failures, and with each failure we double the interval, until we reach a max of 180s. When we record a successful call, we reduce the interval by 1s, so it will quickly back off if many errors occur, and slowly creep back to full speed when the errors are over.
@KludgeKML KludgeKML force-pushed the sidekiq-slow-down-on-error branch from 6f9a7d0 to fb6f7e6 Compare October 12, 2023 14:30
@KludgeKML KludgeKML merged commit 1706d3c into main Oct 18, 2023
4 checks passed
@KludgeKML KludgeKML deleted the sidekiq-slow-down-on-error branch October 18, 2023 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants