Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: integration_tests::metrics::test_metrics #6895

Closed
smklein opened this issue Oct 17, 2024 · 6 comments
Closed

test failed in CI: integration_tests::metrics::test_metrics #6895

smklein opened this issue Oct 17, 2024 · 6 comments
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@smklein
Copy link
Collaborator

smklein commented Oct 17, 2024

This test failed on a CI run on #6881

https://github.com/oxidecomputer/omicron/pull/6881/checks?check_run_id=31702768426

Link here to the specific line of output from the buildomat log showing the failure:
https://buildomat.eng.oxide.computer/wg/0/details/01JAE89628KSREHPCPZHC8DVS0/TV1XNO9ztAlr3QY3XQh3zGrYKkUsoCJHt0VvRzVuqCqsj1Ab/01JAE89QQBY3BRXDXDZ6N76977#S6309

Excerpt from the log showing the failure:

6325	2024-10-17T22:43:00.424Z	--- STDERR:              omicron-nexus::test_all integration_tests::metrics::test_metrics ---
6326	2024-10-17T22:43:00.424Z	log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.0.log
6327	2024-10-17T22:43:00.424Z	note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.0.log"
6328	2024-10-17T22:43:00.424Z	DB URL: postgresql://root@[::1]:41953/omicron?sslmode=disable
6329	2024-10-17T22:43:00.424Z	DB address: [::1]:41953
6330	2024-10-17T22:43:00.424Z	log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.2.log
6331	2024-10-17T22:43:00.424Z	note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.2.log"
6332	2024-10-17T22:43:00.424Z	log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.3.log
6333	2024-10-17T22:43:00.424Z	note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.3.log"
6334	2024-10-17T22:43:00.424Z	thread 'integration_tests::metrics::test_metrics' panicked at nexus/tests/integration_tests/metrics.rs:804:6:
6335	2024-10-17T22:43:00.424Z	Failed to find producer within time limit: TimedOut(60.187220276s)
6336	2024-10-17T22:43:00.424Z	stack backtrace:
6337	2024-10-17T22:43:00.424Z	   0: rust_begin_unwind
6338	2024-10-17T22:43:00.424Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
6339	2024-10-17T22:43:00.424Z	   1: core::panicking::panic_fmt
6340	2024-10-17T22:43:00.424Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
6341	2024-10-17T22:43:00.425Z	   2: core::result::unwrap_failed
6342	2024-10-17T22:43:00.425Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1679:5
6343	2024-10-17T22:43:00.425Z	   3: core::result::Result<T,E>::expect
6344	2024-10-17T22:43:00.425Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1059:23
6345	2024-10-17T22:43:00.425Z	   4: {async_fn#0}
6346	2024-10-17T22:43:00.425Z	             at ./tests/integration_tests/metrics.rs:787:5
6347	2024-10-17T22:43:00.425Z	   5: {async_fn#0}<newtype_uuid::TypedUuid<omicron_uuid_kinds::OmicronZoneKind>>
6348	2024-10-17T22:43:00.425Z	             at ./tests/integration_tests/metrics.rs:778:71
6349	2024-10-17T22:43:00.425Z	   6: {async_fn#0}
6350	2024-10-17T22:43:00.425Z	             at ./tests/integration_tests/metrics.rs:189:6
6351	2024-10-17T22:43:00.425Z	   7: {async_block#0}
6352	2024-10-17T22:43:00.425Z	             at ./tests/integration_tests/metrics.rs:176:1
6353	2024-10-17T22:43:00.425Z	   8: poll<&mut dyn core::future::future::Future<Output=()>>
6354	2024-10-17T22:43:00.425Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
6355	2024-10-17T22:43:00.425Z	   9: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6356	2024-10-17T22:43:00.425Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
6357	2024-10-17T22:43:00.425Z	  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6358	2024-10-17T22:43:00.425Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:696:57
6359	2024-10-17T22:43:00.425Z	  11: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
6360	2024-10-17T22:43:00.425Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:107:5
6361	2024-10-17T22:43:00.425Z	  12: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
6362	2024-10-17T22:43:00.425Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:73:5
6363	2024-10-17T22:43:00.425Z	  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6364	2024-10-17T22:43:00.425Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:696:25
6365	2024-10-17T22:43:00.425Z	  14: tokio::runtime::scheduler::current_thread::Context::enter
6366	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:423:19
6367	2024-10-17T22:43:00.426Z	  15: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6368	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:695:36
6369	2024-10-17T22:43:00.426Z	  16: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
6370	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:774:68
6371	2024-10-17T22:43:00.426Z	  17: tokio::runtime::context::scoped::Scoped<T>::set
6372	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/scoped.rs:40:9
6373	2024-10-17T22:43:00.426Z	  18: tokio::runtime::context::set_scheduler::{{closure}}
6374	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:26
6375	2024-10-17T22:43:00.426Z	  19: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
6376	2024-10-17T22:43:00.426Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
6377	2024-10-17T22:43:00.426Z	  20: std::thread::local::LocalKey<T>::with
6378	2024-10-17T22:43:00.426Z	             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
6379	2024-10-17T22:43:00.426Z	  21: tokio::runtime::context::set_scheduler
6380	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:9
6381	2024-10-17T22:43:00.426Z	  22: tokio::runtime::scheduler::current_thread::CoreGuard::enter
6382	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:774:27
6383	2024-10-17T22:43:00.426Z	  23: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
6384	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:683:19
6385	2024-10-17T22:43:00.426Z	  24: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6386	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:191:28
6387	2024-10-17T22:43:00.426Z	  25: tokio::runtime::context::runtime::enter_runtime
6388	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/runtime.rs:65:16
6389	2024-10-17T22:43:00.426Z	  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6390	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:179:9
6391	2024-10-17T22:43:00.426Z	  27: tokio::runtime::runtime::Runtime::block_on_inner
6392	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/runtime.rs:361:47
6393	2024-10-17T22:43:00.426Z	  28: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6394	2024-10-17T22:43:00.426Z	             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/runtime.rs:335:13
6395	2024-10-17T22:43:00.426Z	  29: test_metrics
6396	2024-10-17T22:43:00.427Z	             at ./tests/integration_tests/metrics.rs:176:1
6397	2024-10-17T22:43:00.427Z	  30: test_all::integration_tests::metrics::test_metrics::{{closure}}
@smklein smklein added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 17, 2024
@jgallagher
Copy link
Contributor

I hit this locally and looked into it for a bit. I think this might be a bug / race in Oximeter producer assignment. There are two ways Oximeter decides what its producers are:

  1. It runs a refresh loop where it periodically asks Nexus for its full set of producers. The default interval for this loop is 10 minutes; we don't change that in tests, so it effectively only runs once, around the time Oximeter starts.
  2. When a new producer registers with Nexus, Nexus sends a post to Oximeter telling it about the new producer.

Oximeter logs all of these; snipping out just the relevant bits from a failed test:

13:42:53.584Z INFO test_mgs_metrics (oximeter): refreshing list of producers from Nexus
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    file = oximeter/collector/src/agent.rs:761
13:42:53.622Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = [::1]:44257
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = a6458b7d-87c3-4483-be96-854d814c20de
13:42:53.658Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = [::1]:40449
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = 47991ac2-392b-4188-bf19-96e0f0440d14
13:42:53.678Z INFO test_mgs_metrics (oximeter): refreshed list of producers from Nexus
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    file = oximeter/collector/src/agent.rs:809
    n_current_tasks = 1
    n_pruned_tasks = 1
13:42:54.442Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = 127.0.0.1:41885
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = e559cb0e-f223-4273-a733-d5f31b1be530

So I think what happened was:

  1. Oximeter sent a request to Nexus to list its current producers; this is a series of paginated requests (in practice in this test it's two requests: one that fetches the first page, and a second that gets an empty result, signaling the end of the list).
  2. Nexus POST'd a new collector (a6458b7d-87c3-4483-be96-854d814c20de; this isn't MGS).
  3. Nexus POST'd another new collector (47991ac2-392b-4188-bf19-96e0f0440d14; this one is MGS).
  4. Oximeter finished getting its list of producers. There was only one producer in this list (n_current_tasks = 1), so it pruned the other one it new about (MGS, because we got unlucky from timing or UUID ordering or something).
  5. Nexus POST'd another new collector (e559cb0e-f223-4273-a733-d5f31b1be530; also not MGS).

At this point Oximeter is aware of 2 collectors but pruned the third, and the one this test in particular cares about. It would have corrected itself the next time the refresh loop ran, but that's 10 minutes away, and the test times out after 1.

@bnaecker
Copy link
Collaborator

Thanks @jgallagher for the analysis, I think that is all consistent with the behavior we see here.

One possible solution is to keep track of the time we start a refresh and the time every producer is stored. When we complete the refresh and update the internal set of producers, we prune producers that are in our map but not the refreshed list, except those which we inserted after we started the refresh. I think that should ensure that this exact sequence is handled correctly, but we might need to think more carefully about other possible races.

@bnaecker bnaecker self-assigned this Oct 18, 2024
@bnaecker
Copy link
Collaborator

@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:

  • a generation number on the whole set of collection tasks
  • a per-producer generation number that records the generation of the set in which the producer was added or updated

As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into ensure_producers() with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.

That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.

@andrewjstone
Copy link
Contributor

@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:

  • a generation number on the whole set of collection tasks
  • a per-producer generation number that records the generation of the set in which the producer was added or updated

As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into ensure_producers() with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.

That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.

Epoch based reclamation ;)

bnaecker added a commit that referenced this issue Oct 18, 2024
- Add generation numbers to the collection of oximeter producers, and
  assign the currrent generation to each producer as it is registered.
- Modify the refresh method to first take the generation number before
  starting to list current producers. Then use that to avoid pruning
  producers that are _new_ since we started refreshing our list.
- Fixes #6895 and possibly #6901
@hawkw
Copy link
Member

hawkw commented Oct 23, 2024

I saw a different test (integration_tests::oximeter::test_oximeter_reregistration) fail on #6913: https://buildomat.eng.oxide.computer/wg/0/details/01JAXAGXWPE822857D5D1DP0N4/DMz6rNUPg3AtDtKClnW7rWsxzHSSFztK1IEKxJe1w21To9FJ/01JAXAH9TM5NJ0PDYF6RSJXV5H#S5672

This sure feels related to me, so I thought I'd mention it here instead of making fresh issue for it.

bnaecker added a commit that referenced this issue Oct 24, 2024
- Remove `oximeter` producer HTTP endpoint for registering individual
  producers.
- Dramatically reduce interval on which `oximeter` collector refreshes
  its list of producers. This is now the only way the collector learns
  of producers. The interval is also much smaller in tests to ensure
  pretty snappy registrations
- Remove calls from both Nexus and the `oximeter` standalone mock Nexus
  for registering producers
- Have `oximeter` collector start polling producers immediately, rather
  than waiting for the first polling interval to expire.
- Closes #6916, #6895, and #6901
bnaecker added a commit that referenced this issue Oct 25, 2024
- Remove `oximeter` producer HTTP endpoint for registering individual
producers.
- Dramatically reduce interval on which `oximeter` collector refreshes
its list of producers. This is now the only way the collector learns of
producers. The interval is also much smaller in tests to ensure pretty
snappy registrations
- Remove calls from both Nexus and the `oximeter` standalone mock Nexus
for registering producers
- Have `oximeter` collector start polling producers immediately, rather
than waiting for the first polling interval to expire.
- Closes #6916, #6895, and #6901
@bnaecker
Copy link
Collaborator

Closed by #6926

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants