-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed in CI: integration_tests::metrics::test_metrics #6895
Comments
I hit this locally and looked into it for a bit. I think this might be a bug / race in Oximeter producer assignment. There are two ways Oximeter decides what its producers are:
Oximeter logs all of these; snipping out just the relevant bits from a failed test:
So I think what happened was:
At this point Oximeter is aware of 2 collectors but pruned the third, and the one this test in particular cares about. It would have corrected itself the next time the refresh loop ran, but that's 10 minutes away, and the test times out after 1. |
Thanks @jgallagher for the analysis, I think that is all consistent with the behavior we see here. One possible solution is to keep track of the time we start a refresh and the time every producer is stored. When we complete the refresh and update the internal set of producers, we prune producers that are in our map but not the refreshed list, except those which we inserted after we started the refresh. I think that should ensure that this exact sequence is handled correctly, but we might need to think more carefully about other possible races. |
@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:
As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number. |
Epoch based reclamation ;) |
- Add generation numbers to the collection of oximeter producers, and assign the currrent generation to each producer as it is registered. - Modify the refresh method to first take the generation number before starting to list current producers. Then use that to avoid pruning producers that are _new_ since we started refreshing our list. - Fixes #6895 and possibly #6901
I saw a different test ( This sure feels related to me, so I thought I'd mention it here instead of making fresh issue for it. |
- Remove `oximeter` producer HTTP endpoint for registering individual producers. - Dramatically reduce interval on which `oximeter` collector refreshes its list of producers. This is now the only way the collector learns of producers. The interval is also much smaller in tests to ensure pretty snappy registrations - Remove calls from both Nexus and the `oximeter` standalone mock Nexus for registering producers - Have `oximeter` collector start polling producers immediately, rather than waiting for the first polling interval to expire. - Closes #6916, #6895, and #6901
- Remove `oximeter` producer HTTP endpoint for registering individual producers. - Dramatically reduce interval on which `oximeter` collector refreshes its list of producers. This is now the only way the collector learns of producers. The interval is also much smaller in tests to ensure pretty snappy registrations - Remove calls from both Nexus and the `oximeter` standalone mock Nexus for registering producers - Have `oximeter` collector start polling producers immediately, rather than waiting for the first polling interval to expire. - Closes #6916, #6895, and #6901
Closed by #6926 |
This test failed on a CI run on #6881
https://github.com/oxidecomputer/omicron/pull/6881/checks?check_run_id=31702768426
Link here to the specific line of output from the buildomat log showing the failure:
https://buildomat.eng.oxide.computer/wg/0/details/01JAE89628KSREHPCPZHC8DVS0/TV1XNO9ztAlr3QY3XQh3zGrYKkUsoCJHt0VvRzVuqCqsj1Ab/01JAE89QQBY3BRXDXDZ6N76977#S6309
Excerpt from the log showing the failure:
The text was updated successfully, but these errors were encountered: