Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle 5xx errors #29

Open
TaxBusby opened this issue Feb 13, 2024 · 3 comments
Open

Gracefully handle 5xx errors #29

TaxBusby opened this issue Feb 13, 2024 · 3 comments

Comments

@TaxBusby
Copy link

Hey, this extension is awesome. Thanks for sharing it.

I'm trying to use it to build some automated reporting of actions usage for our organization. This is probably just an intermittent github issue, but right now some of my workflows return 502 on the /timing endpoint.

GitHub Actions Usage (2117caa)

panic: HTTP 502: Server Error (https://api.github.com/repos/xxxx/xxxx/actions/workflows/xxxx/timing)

goroutine 1 [running]:
main.getRepoUsage(0xc000150510)
	github.com/geoffreywiseman/gh-actions-usage/main.go:166 +0x2b6
main.tryDisplayAllSpecified({{0x70c3ac, 0x5}, 0x0, {0x792cc0, 0xc000028ef0}}, {0xc000010050?, 0x2e?, 0xc00005a000?})
	github.com/geoffreywiseman/gh-actions-usage/main.go:84 +0x1e6
main.main()
	github.com/geoffreywiseman/gh-actions-usage/main.go:40 +0x32a

When this happens, the whole process aborts. So if 1 repo is failing, I don't get any data in my report. When running locally, I can rerun. But during automation, I'd rather see it continue and try the other workflows/endpoints to get partial data.

Statistically speaking, if GitHub has a 0.1% chance of returning a 5xx, then if I run this on a schedule scanning ~200 repos, it's likely going to fail.

@geoffreywiseman
Copy link
Contributor

There's already some error handling in there, but certainly could use more. Of course, being able to produce errors so that I can simulate them for a test is definitely a factor -- I haven't hit status 502 very often, but even having this panic report should make it a little easier. I'll take a look, and if you're able to reproduce with some consistency, then I hope you'll also be able to tell me if the changes help, once I have some.

@geoffreywiseman
Copy link
Contributor

geoffreywiseman commented Feb 24, 2024

Thinking through options.

Automatic Retries

I could retry in the case of a 502, and if most GH 502s are transient and a retry will fix it, that might solve the problem. But not all errors are 502, not every 502 will be fixed by an immediate retry and there's lots of different REST calls that are made.

That could mean I fix 502 on 'get repo usage' from your panic and something else immediately dies -- a different call, or a different 5XX or the retry also fails, and it's not really fixed. But ... it might work most of the time and probably delivers the best experience, even though it might mean slowly adding retries to more API calls and deciding which HTTP statuses are most likely to work after retry.

Error Output

On the other hand, if no amount of retries is guaranteed to fix it, might be better to simply display an error . Except that in order to separate information-gathering from presentation, the information passes through a simple data structure before being presented and that structure doesn't have a good place to put errors. Panicking solves that, but means, as you said, that you don't get an answer, and if errors aren't super-rare, then affects the usability of the extension as a whole.

The structure could be changed to carry error information, but I'll have to decide what to pass and how to present it for current and future formats. That's not the end of the world, but it's definitely a bit of design and implementation work that is more cross-cutting than retries.

And some errors maybe should still be panicked, but that leaves a question of which errors to panic and which to report-and-continue.

Path Forward

Tempted to start with a quick attempt at retry logic on 'get workflow usage' and see if that works more often than not for you. Depending on how that goes, I'll have a better idea of whether I should expand on that to handle more errors or if it's not worth the trouble and I should report them instead.

@TaxBusby
Copy link
Author

I am also hitting rate limiting issues when running the script a lot (probably due to the large repo count) so I will have to refactor/iterate a bit to get better data for you. Interestingly it looks like all of my runs with 502s were for the same workflow URL. It doesn't happen every time but it may actually point to an issue with a specific workflow run.

Thanks for the replies, I will try to provide more info when I get a chance to work on it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants