Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17696 New consumer background operations unaware of metadata errors #17440

Open
wants to merge 49 commits into
base: trunk
Choose a base branch
from

Conversation

m1a2st
Copy link
Contributor

@m1a2st m1a2st commented Oct 10, 2024

Jira: https://issues.apache.org/jira/browse/KAFKA-17696

When API calls that handle background events (e.g., poll, unsubscribe, close) encounter errors, the errors are only passed to the application thread via ErrorEvent.
Other API calls that do not process background events (e.g., position) are not notified of these errors, meaning that issues like unauthorized access to topics will go unnoticed by those operations.
Background operations are not aborted or notified when a metadata error occurs, such as an Unauthorized error, which can lead to situations where a call like position keeps waiting for an update, despite the Unauthorized error already happening.

Due to the blocking issue in applicationEventHandler.addAndGet(checkAndUpdatePositionsEvent);, I consider that we should use processBackgroundEvents to get the events, that is better than addAndGet.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kirktrue kirktrue added KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) labels Oct 14, 2024
Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @m1a2st!

This looks incomplete as is 🤔

I'd thought of one suggestion, but I'm not sure if it would work or if I even like it— Add a new instance variable to store authorization exceptions (e.g. UnauthorizedTopicException) and then update processBackgroundEvents()catch block to check for authorization errors and store then in that variable. Then add a maybeThrowAuthorizationException() that conditionally throws the error if it's non-null. We'd have to clear out the exception on subscribe() or assign(), but it might work.

Please take a look at @lianetm's comments on KAFKA-17696 again as I think she has some suggestions worth pursuing.

Thanks!

@m1a2st
Copy link
Contributor Author

m1a2st commented Oct 16, 2024

Sorry for late to reply you,

but I'm not sure if it would work or if I even like it— Add a new instance variable to store authorization exceptions (e.g. UnauthorizedTopicException) and then update processBackgroundEvents()’ catch block to check for authorization errors and store then in that variable. Then add a maybeThrowAuthorizationException() that conditionally throws the error if it's non-null. We'd have to clear out the exception on subscribe() or assign(), but it might work.

Thanks @kirktrue suggestions, I find a good way to resolve the close problem, the TopicAuthorizationException[1] is similar with InvalidTopicException, thus I think I can add an or check in this if else condition [2], it is more clear and simplify
[1]

() -> releaseAssignmentAndLeaveGroup(closeTimer), firstException);

[2]
processBackgroundEvents(unsubscribeEvent.future(), timer, e -> e instanceof InvalidTopicException);

Please take a look at @lianetm's comments on KAFKA-17696 again as I think she has some suggestions worth pursuing.

I will take more deep think in these comments

rough to solve this problem, but in close there are another problem
@github-actions github-actions bot added core Kafka Broker and removed small Small PRs labels Oct 17, 2024
# Conflicts:
#	core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala
@m1a2st m1a2st marked this pull request as ready for review October 17, 2024 15:34
Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @m1a2st!

  1. Is it sufficient to perform a single check of the background events before submitting the application event, or do we really need to perform multiple checks of the background events while we wait for the application event to complete?
  2. Do we need to perform this same check in more places than just the handful in this PR?

Comment on lines 1138 to 1142
applicationEventHandler.add(listOffsetsEvent);
offsetAndTimestampMap = processBackgroundEvents(
listOffsetsEvent.future(),
timer, __ -> false
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, we need to check for the errors from the background thread. But do we need to check repeatedly during the execution of the ListOffsetsEvent, or can we just check once beforehand?

Suggested change
applicationEventHandler.add(listOffsetsEvent);
offsetAndTimestampMap = processBackgroundEvents(
listOffsetsEvent.future(),
timer, __ -> false
);
processBackgroundEvents();
offsetAndTimestampMap = applicationEventHandler.addAndGet(listOffsetsEvent);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sufficient to perform a single check of the background events before submitting the application event, or do we really need to perform multiple checks of the background events while we wait for the application event to complete?

It can't process processBackgroundEvents() only once before applicationEventHandler.addAndGet. Test will be fail if I change to below, I think a loop for processBackgroundEvents is necessary

 processBackgroundEvents();
offsetAndTimestampMap = applicationEventHandler.addAndGet(listOffsetsEvent);

Comment on lines 1291 to 1292
processBackgroundEvents(unsubscribeEvent.future(), timer,
e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, could you introduce a Predicate variable, a la:

Suggested change
processBackgroundEvents(unsubscribeEvent.future(), timer,
e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);
final Predicate<Exception> ignoreExceptions = e ->
e instanceof InvalidTopicException ||
e instanceof TopicAuthorizationException ||
e instanceof GroupAuthorizationException;
processBackgroundEvents(unsubscribeEvent.future(), timer, ignoreExceptions);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a similar fix for AsyncKafkaConsumer#unsubscribe as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm are we sure that swallowing TopicAuth and GroupAuth on close is the right thing to do? I could surely be missing something, but I believe it's not what the classic consumer does, see my comment on it on the other PR that is also attempting this #17516 (comment)
Thoughts?

Comment on lines 1596 to 1600
applicationEventHandler.add(checkAndUpdatePositionsEvent);
cachedSubscriptionHasAllFetchPositions = processBackgroundEvents(
checkAndUpdatePositionsEvent.future(),
timer, __ -> false
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
applicationEventHandler.add(checkAndUpdatePositionsEvent);
cachedSubscriptionHasAllFetchPositions = processBackgroundEvents(
checkAndUpdatePositionsEvent.future(),
timer, __ -> false
);
processBackgroundEvents();
cachedSubscriptionHasAllFetchPositions = applicationEventHandler.addAndGet(checkAndUpdatePositionsEvent);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't process processBackgroundEvents() only once before applicationEventHandler.addAndGet. Test will be fail if I change to below, I think a loop for processBackgroundEvents is necessary

This section also failed with we only process once.

@kirktrue
Copy link
Collaborator

Also, this PR overlaps a lot with PR #17516, right?

@kirktrue
Copy link
Collaborator

I believe we need to have some filtering in the background event processing logic, because we don't want the checks to inadvertently execute the ConsumerRebalanceListenerCallbackNeededEvent if a rebalance was initiated in the background thread.

Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @m1a2st, thank for the PR. It looks like the PR is similar as #17516. I will close mine. Thanks.

Comment on lines 1291 to 1292
processBackgroundEvents(unsubscribeEvent.future(), timer,
e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a similar fix for AsyncKafkaConsumer#unsubscribe as well.

@m1a2st
Copy link
Contributor Author

m1a2st commented Oct 18, 2024

Hello @FrankYang0529 feel free to reopen PR, this PR can focus processBackgroundEvents and applicationEventHandler.addAndGet. Due to I can't pass the test, so I fix close method in this PR.

@m1a2st
Copy link
Contributor Author

m1a2st commented Oct 18, 2024

Do we need to perform this same check in more places than just the handful in this PR?

If we want process all background event from the backgroundEventQueue, I think when we call applicationEventHandler.addAndGet we always need to call processBackgroundEvents first, check the backgroundEventQueue doesn't have any error event.

@kirktrue
Copy link
Collaborator

@m1a2st—I tested this fix by merging the changes in this PR with the changes from my PR that sets the default group.protocol value to CONSUMER. Unfortunately, the integration tests that exercise authorization policies fail even with this fix 😢

If you want to test that your change works across our topic and consumer group authorization integration tests, do the following:

  1. Change the default for group.protocol in ConsumerConfig
  2. Run the integration tests which subclass EndToEndAuthorizationTest

@m1a2st
Copy link
Contributor Author

m1a2st commented Oct 31, 2024

@kirktrue , Thanks for your reminder, I will take a look at these fail tests

# Conflicts:
#	core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala
@m1a2st
Copy link
Contributor Author

m1a2st commented Nov 20, 2024

Hello @lianetm, @kirktrue I will focus on NetworkClientDelegate#maybePropagateMetadataError in this PR, and revert CoordinatorRequestManager fatal error this PR, and mark fail tests which fail reason is CoordinatorRequestManager for getTestQuorumAndGroupProtocolParametersClassicGroupProtocolOnly_KAFKA_18034
Is it make sense?

@lianetm
Copy link
Contributor

lianetm commented Nov 20, 2024

I will focus on NetworkClientDelegate#maybePropagateMetadataError in this PR, and revert CoordinatorRequestManager fatal error this PR, and mark fail tests which fail reason is CoordinatorRequestManager for getTestQuorumAndGroupProtocolParametersClassicGroupProtocolOnly_KAFKA_18034

Sounds great to me! So I expect we'll end up with this PR addressing how we propagate metadata errors within the background thread to fail requests that should be aware of the error (should unblock all auth tests expecting TopicAuth error in api calls). Another PR addressing how we propagate coordinator errors within the background to fail requests similarly (unblock tests expecting GroupAuthErrors in api calls)

@m1a2st
Copy link
Contributor Author

m1a2st commented Nov 20, 2024

don't yet understand the need for passing the metadata error around in a Future. And I'm also still wondering if this could be handled at a lower layer so that we don't have to have bespoke code in the request managers to deal with it.

The rationale behind this design is that when ConsumerNetworkThread#processApplicationEvents executes checkAndUpdatePositionsEvent, the TopicAuthorizationException doesn’t surface immediately. Instead, it may require several iterations of runOnce for the error to become apparent. Without using the future-based approach, it would be impossible to propagate this error from the background thread to the OffsetsRequestManager.

@m1a2st
Copy link
Contributor Author

m1a2st commented Nov 20, 2024

I’m thinking that some test fail for methods like consumer.poll, which involve processBackgroundEvent, if a TopicAuthorizationException occurs, these two error handling mechanisms might conflict, leading to behavior that deviates from expectations.

@lianetm
Copy link
Contributor

lianetm commented Nov 20, 2024

Hey @m1a2st, sharing a thought in case it helps. First, the problem we have is that api calls like position/endOffsets trigger events that should fail with topic metadata errors but they don't, and are left hanging until they time out. So, with that in mind, it occurred to me that we do have all the events that are awaiting responses in hand when then ConsumerNetworkThread.runOnce happens, because we have them within the reaper, that keeps all the completableEvents so they can be expired eventually. Couldn't we take those events and let them know about the error when it happens? Then each event decides if it should fail on topic metadata error or not. I'm picturing something along these lines:

On ConsumerNetworkThread.runOnce:

        // 1. get metadata error that happens here
        networkClientDelegate.poll(pollWaitTimeMs, currentTimeMs);
        ...
        // 2. get all awaiting events after expiration applies (the reaper has them all, not just the ones generated on the current runOnce)
        List<CompletableApplicationEvent> awaitingEvents = reapExpiredApplicationEvents(currentTimeMs);

        // 3. notify awaiting events about the metadata error
        if (metadataError != null) {
            awaitingEvents.forEach(e -> e.onMetadataError(metadataError));
        }

Would that work? I see that the main advantages would be to avoid the complexity of metadata future errors passed around to specific manager calls, and also it would be a solution applied consistently to all events (each event type then deciding if it should fail or not on topic metadata errors). onMetadataError, events could no-op by default, and some should override to simply do future.completeExceptionally, ex. CheckAndUpdatePositionsEvent, CommitEvent (these two seem to be the ones leading to the failed tests in the Authorizer file, we can get into details later about what others should consider the error).

I could be missing something but sharing in case it helps! Let me know.

@lianetm
Copy link
Contributor

lianetm commented Nov 21, 2024

I’m thinking that some test fail for methods like consumer.poll, which involve processBackgroundEvent, if a TopicAuthorizationException occurs, these two error handling mechanisms might conflict, leading to behavior that deviates from expectations.

Sorry I missed this comment before. Great point, the issue is that with this PR (no matter how we implement it) we end up failing api calls/events on metadata errors, but still also keeping the previous logic that generated an ErrorEvent for them.

We were propagating metadata errors via ErrorEvent thinking that it was only meant to be consumed from poll (which was a wrong assumption). If, with this PR, we introduce a mechanism to propagate it via the api events, I wonder if we should consider removing the redundant ErrorEvent for this case? (without ErrorEvent, poll would still fail as expected, because the CheckAndUpdatePositions would fail with the auth error)

@m1a2st
Copy link
Contributor Author

m1a2st commented Nov 22, 2024

Hello @lianetm, Sorry for the late response.

Would that work? I see that the main advantages would be to avoid the complexity of metadata future errors passed around to specific manager calls, and also it would be a solution applied consistently to all events (each event type then deciding if it should fail or not on topic metadata errors).

I think this approach is great significantly simplifies the system by eliminating the need to pass CompletedFuture around, which reduces complexity. Also, based on current testing, the failing tests are still just these few.

@lianetm
Copy link
Contributor

lianetm commented Nov 22, 2024

Hey @m1a2st , just FYI, we just enabled some auth tests that were marked as blocked on this issue, but are really not blocked on this (#17885). So just merge trunk latest changes, and we can have this PR addressing/enabling only what's really related to this fix. Thanks!

@github-actions github-actions bot added the small Small PRs label Nov 23, 2024
@m1a2st
Copy link
Contributor Author

m1a2st commented Nov 23, 2024

Hello @lianetm, Thanks for your review.

We were propagating metadata errors via ErrorEvent thinking that it was only meant to be consumed from poll (which was a wrong assumption). If, with this PR, we introduce a mechanism to propagate it via the api events, I wonder if we should consider removing the redundant ErrorEvent for this case? (without ErrorEvent, poll would still fail as expected, because the CheckAndUpdatePositions would fail with the auth error)

Based on this issue, the most straightforward solution I can think of at the moment is to add a new attribute in the event to determine whether the method call requires the use of the completedFuture for transmission. I have already drafted a version for this approach. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clients consumer core Kafka Broker ctr Consumer Threading Refactor (KIP-848) KIP-848 The Next Generation of the Consumer Rebalance Protocol small Small PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants