Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KafkaSinkCluster: handle broker closing connection due to idle timeout #1752

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

rukai
Copy link
Member

@rukai rukai commented Sep 17, 2024

Kafka brokers have a config field connections.max.idle.ms, when a connection has been idle for longer than this timeout.
Normally a client can expect idle connections to timeout after some time of being unused.
However when KafkaSinkCluster is sitting in between the client and the broker there can be some unexpected behaviour.

For example:

  1. We have a kafka cluster configured with connections.max.idle.ms=5s.
  2. A client sends a request every 1s.
  3. Shotover routes the first request to node A
  4. Shotover routes the next 10 requests to node B
  5. Shotover routes another request to node A
    • this request will fail due to the connection hitting idle time out even though the client was sending one request per second, well within the timeout!

With more reasonable timeout values this issue is harder to hit, but we are still still hitting this in our preprod environment with the default idle timeout of 10 minutes.

Integration test

This PR introduces an integration test that runs kafka with a very low connections.max.idle.ms (30s), this reproduces the issue described above.
The new integration test fails when running the shotover main branch.

Fix implementation

This PR then fixes the failing test by adding a check into our connection getting logic to recreate the connection if it has hit an error.
This does not attempt to fix all possible failures due to closed connection, as that would be impossible due to race conditions.
Instead we make a best effort handling which will catch the regular failures we are seeing.

@rukai rukai force-pushed the catch_closed_connections_during_send branch from c3f2530 to 06ae653 Compare September 17, 2024 01:33
Copy link

codspeed-hq bot commented Sep 17, 2024

CodSpeed Performance Report

Merging #1752 will improve performances by 10.65%

Comparing rukai:catch_closed_connections_during_send (5722c26) with main (12ff171)

Summary

⚡ 1 improvements
✅ 38 untouched benchmarks

Benchmarks breakdown

Benchmark main rukai:catch_closed_connections_during_send Change
encode_request_metadata 11.6 µs 10.5 µs +10.65%

@rukai rukai force-pushed the catch_closed_connections_during_send branch 3 times, most recently from 5fa3428 to f1454b4 Compare September 17, 2024 03:01
@rukai rukai force-pushed the catch_closed_connections_during_send branch from f1454b4 to 20faa0f Compare September 17, 2024 03:04
@rukai rukai force-pushed the catch_closed_connections_during_send branch from 69d3073 to ffca2ef Compare September 17, 2024 03:59
@rukai rukai changed the title catch closed connections during send KafkaSinkCluster handle kafka broker closing connections due to idle timeout Sep 17, 2024
@rukai rukai changed the title KafkaSinkCluster handle kafka broker closing connections due to idle timeout KafkaSinkCluster: handle broker closing connection due to idle timeout Sep 17, 2024
@rukai rukai force-pushed the catch_closed_connections_during_send branch from ffca2ef to 356bd0e Compare September 17, 2024 04:27
@rukai rukai marked this pull request as ready for review September 17, 2024 04:27
@rukai rukai force-pushed the catch_closed_connections_during_send branch from 356bd0e to 5722c26 Compare September 17, 2024 05:16
@justinweng-instaclustr
Copy link
Collaborator

justinweng-instaclustr commented Sep 18, 2024

Hey Lucas, the connections.max.idle.ms in description should be milliseconds, right? But it's referred to as seconds
EDIT: Seconds are correctly converted to milliseconds in code. All good!

@rukai rukai merged commit 4853f84 into shotover:main Sep 18, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants