Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement routing for ListOffsets #1767

Merged
merged 1 commit into from
Oct 9, 2024
Merged

Conversation

rukai
Copy link
Member

@rukai rukai commented Oct 4, 2024

From inspecting the messages passing through shotover I could see that ListOffsets is hitting NOT_LEADER_OR_FOLLOWER errors. (error code 6)

For example:

shotover   02:07:50.445929Z  INFO connection{id=41 source="kafka"}: shotover::transforms::debug::printer: Response: Kafka version:7 correlation_id:54 ListOffsets(ListOffsetsResponse { throttle_time_ms: 0, topics: [ListOffsetsTopicResponse { name: TopicName("partitions3_case4"), partitions: [ListOffsetsPartitionResponse { partition_index: 0, error_code: 6, old_style_offsets: [], timestamp: -1, offset: -1, leader_epoch: -1, unknown_tagged_fields: {} }], unknown_tagged_fields: {} }], unknown_tagged_fields: {} })

After some retries it will eventually hit the correct node and succeed:

shotover   02:07:51.047014Z  INFO connection{id=41 source="kafka"}: shotover::transforms::debug::printer: Response: Kafka version:7 correlation_id:60 ListOffsets(ListOffsetsResponse { throttle_time_ms: 0, topics: [ListOffsetsTopicResponse { name: TopicName("partitions3_case4"), partitions: [ListOffsetsPartitionResponse { partition_index: 0, error_code: 0, old_style_offsets: [], timestamp: -1, offset: 0, leader_epoch: 0, unknown_tagged_fields: {} }], unknown_tagged_fields: {} }], unknown_tagged_fields: {} })

It seems that the java driver is robust enough to retry only the parts of the request that failed, so eventually it will succesfully complete and move on. However, it is still quite slow and wasteful to error like this.
To fix these errors, we need to split and combine the request like we do for fetch and produce messages.

So this PR implements that fetch/combine logic for ListOffsets.

I added an integration test to call the listOffsets method from the admin API.
This test does not fail due to java's robust retry mechanism but I've manually verified that the errors are gone with the new routing logic and the new test adds more coverage of the driver.

Another nice outcome of this fix is that cluster_1_rack_single_shotover::case_2_java now completes in 100s down from 110s

Copy link

codspeed-hq bot commented Oct 4, 2024

CodSpeed Performance Report

Merging #1767 will not alter performance

Comparing rukai:list_offsets_routing (74f4d3b) with main (9801ed4)

Summary

✅ 39 untouched benchmarks

@rukai rukai force-pushed the list_offsets_routing branch 4 times, most recently from 5c8357b to fd7153c Compare October 9, 2024 03:23
@rukai rukai marked this pull request as ready for review October 9, 2024 03:50
@rukai rukai force-pushed the list_offsets_routing branch from 925c077 to be9dfa3 Compare October 9, 2024 03:52
@rukai rukai force-pushed the list_offsets_routing branch from be9dfa3 to 74f4d3b Compare October 9, 2024 04:44
@rukai rukai enabled auto-merge (squash) October 9, 2024 05:03
@rukai rukai merged commit 902273d into shotover:main Oct 9, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants