Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka_int_tests::cluster_1_rack_single_shotover::case_1_cpp fix intermittent failure #1834

Merged
merged 5 commits into from
Nov 25, 2024

Conversation

rukai
Copy link
Member

@rukai rukai commented Nov 22, 2024

Attempt to fix intermittent failure: https://github.com/shotover/shotover-proxy/actions/runs/11926547980/job/33240776520

shotover   04:20:12.648973Z  INFO shotover::runner: Starting Shotover 0.5.2
shotover   04:20:12.649001Z  INFO shotover::runner: configuration=Config { main_log_level: "info, shotover::connection_span=debug", observability_interface: Some("0.0.0.0:9001") }
shotover   04:20:12.649021Z  INFO shotover::runner: topology=Topology { sources: [Kafka(KafkaConfig { name: "kafka", listen_addr: "127.0.0.1:9192", connection_limit: None, hard_connection_limit: None, tls: None, timeout: None, chain: TransformChainConfig([KafkaSinkClusterConfig { first_contact_points: ["172.16.1.2:9092"], shotover_nodes: [ShotoverNodeConfig { address_for_clients: "127.0.0.1:9192", address_for_peers: "127.0.0.1:9192", rack: "rack0", broker_id: 0 }], local_shotover_broker_id: 0, connect_timeout_ms: 3000, read_timeout: None, check_shotover_peers_delay_ms: Some(3000), tls: None, authorize_scram_over_mtls: None }]) })] }
shotover   04:20:12.649075Z  INFO shotover::sources::kafka: Starting Kafka source on [127.0.0.1:9192]
shotover   04:20:12.649286Z  INFO shotover::config::topology: Shotover is now accepting inbound connections
shotover   04:20:12.664623Z PANIC panicked at shotover/src/transforms/kafka/sink_cluster/mod.rs:3885:35:
called `Option::unwrap()` on a `None` value
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::unwrap_failed
   4: shotover::transforms::kafka::sink_cluster::random_broker_id
   5: shotover::transforms::kafka::sink_cluster::KafkaSinkCluster::route_to_random_broker
   6: shotover::transforms::kafka::sink_cluster::KafkaSinkCluster::route_requests::{{closure}}
   7: <shotover::transforms::kafka::sink_cluster::KafkaSinkCluster as shotover::transforms::Transform>::transform::{{closure}}
   8: shotover::transforms::ChainState::call_next_transform::{{closure}}
   9: shotover::transforms::chain::TransformChain::process_request::{{closure}}
  10: shotover::server::Handler<C>::send_receive_chain::{{closure}}
  11: shotover::server::Handler<C>::run::{{closure}}
  12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::raw::poll
  14: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  15: tokio::runtime::task::raw::poll

The panic occurs because we attempted to route to a random node but shotover does not have any nodes in its metadata.
This should be impossible because we always query for a list of nodes before we reach the routing stage.

From looking at the log I can see that we hit the panic 15ms after shotover was ready to accept connections.
This indicates that while the kafka cluster was ready for connections it likely still had a little bit of initialization work left to do and therefore returned an empty list of nodes.
I have seen this kind of behavior before from redis and I think cassandra as well.

So I am fairly confident that adding a retry to our broker list query will resolve this intermittent failure.
Unfortunately I cannot reproduce the issue locally to verify.

@rukai rukai force-pushed the cluster_1_rack_single_shotover_flakey_test branch from c7fc0be to 9d56757 Compare November 22, 2024 03:43
Copy link

codspeed-hq bot commented Nov 22, 2024

CodSpeed Performance Report

Merging #1834 will not alter performance

Comparing rukai:cluster_1_rack_single_shotover_flakey_test (86355d5) with main (47146b6)

Summary

✅ 38 untouched benchmarks

@rukai rukai marked this pull request as ready for review November 22, 2024 04:55
@rukai rukai enabled auto-merge (squash) November 25, 2024 22:37
@rukai rukai merged commit 27248c2 into shotover:main Nov 25, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants