KAFKA-17696 New consumer background operations unaware of metadata errors #17440

m1a2st · 2024-10-10T06:12:01Z

Jira: https://issues.apache.org/jira/browse/KAFKA-17696

When API calls that handle background events (e.g., poll, unsubscribe, close) encounter errors, the errors are only passed to the application thread via ErrorEvent.
Other API calls that do not process background events (e.g., position) are not notified of these errors, meaning that issues like unauthorized access to topics will go unnoticed by those operations.
Background operations are not aborted or notified when a metadata error occurs, such as an Unauthorized error, which can lead to situations where a call like position keeps waiting for an update, despite the Unauthorized error already happening.

Due to the blocking issue in applicationEventHandler.addAndGet(checkAndUpdatePositionsEvent);, I consider that we should use processBackgroundEvents to get the events, that is better than addAndGet.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

kirktrue

Thanks @m1a2st!

This looks incomplete as is 🤔

I'd thought of one suggestion, but I'm not sure if it would work or if I even like it— Add a new instance variable to store authorization exceptions (e.g. UnauthorizedTopicException) and then update processBackgroundEvents()’ catch block to check for authorization errors and store then in that variable. Then add a maybeThrowAuthorizationException() that conditionally throws the error if it's non-null. We'd have to clear out the exception on subscribe() or assign(), but it might work.

Please take a look at @lianetm's comments on KAFKA-17696 again as I think she has some suggestions worth pursuing.

Thanks!

m1a2st · 2024-10-16T05:53:06Z

Sorry for late to reply you,

but I'm not sure if it would work or if I even like it— Add a new instance variable to store authorization exceptions (e.g. UnauthorizedTopicException) and then update processBackgroundEvents()’ catch block to check for authorization errors and store then in that variable. Then add a maybeThrowAuthorizationException() that conditionally throws the error if it's non-null. We'd have to clear out the exception on subscribe() or assign(), but it might work.

Thanks @kirktrue suggestions, I find a good way to resolve the close problem, the TopicAuthorizationException[1] is similar with InvalidTopicException, thus I think I can add an or check in this if else condition [2], it is more clear and simplify
[1]

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 1232 in 604564c

() -> releaseAssignmentAndLeaveGroup(closeTimer), firstException);

[2]

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 1283 in 604564c

    
           processBackgroundEvents(unsubscribeEvent.future(), timer, e -> e instanceof InvalidTopicException);

Please take a look at @lianetm's comments on KAFKA-17696 again as I think she has some suggestions worth pursuing.

I will take more deep think in these comments

rough to solve this problem, but in close there are another problem

# Conflicts: # core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala

kirktrue

Thanks for the PR, @m1a2st!

Is it sufficient to perform a single check of the background events before submitting the application event, or do we really need to perform multiple checks of the background events while we wait for the application event to complete?
Do we need to perform this same check in more places than just the handful in this PR?

kirktrue · 2024-10-17T17:33:26Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+                applicationEventHandler.add(listOffsetsEvent);
+                offsetAndTimestampMap = processBackgroundEvents(
+                        listOffsetsEvent.future(),
+                        timer, __ -> false
+                );


As I understand it, we need to check for the errors from the background thread. But do we need to check repeatedly during the execution of the ListOffsetsEvent, or can we just check once beforehand?

Suggested change

applicationEventHandler.add(listOffsetsEvent);

offsetAndTimestampMap = processBackgroundEvents(

listOffsetsEvent.future(),

timer, __ -> false

);

processBackgroundEvents();

offsetAndTimestampMap = applicationEventHandler.addAndGet(listOffsetsEvent);

Is it sufficient to perform a single check of the background events before submitting the application event, or do we really need to perform multiple checks of the background events while we wait for the application event to complete?

It can't process processBackgroundEvents() only once before applicationEventHandler.addAndGet. Test will be fail if I change to below, I think a loop for processBackgroundEvents is necessary

processBackgroundEvents(); offsetAndTimestampMap = applicationEventHandler.addAndGet(listOffsetsEvent);

kirktrue · 2024-10-17T17:38:06Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+            processBackgroundEvents(unsubscribeEvent.future(), timer,
+                    e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);


For readability, could you introduce a Predicate variable, a la:

Suggested change

processBackgroundEvents(unsubscribeEvent.future(), timer,

e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);

final Predicate<Exception> ignoreExceptions = e ->

e instanceof InvalidTopicException ||

e instanceof TopicAuthorizationException ||

e instanceof GroupAuthorizationException;

processBackgroundEvents(unsubscribeEvent.future(), timer, ignoreExceptions);

I think we need a similar fix for AsyncKafkaConsumer#unsubscribe as well.

uhm are we sure that swallowing TopicAuth and GroupAuth on close is the right thing to do? I could surely be missing something, but I believe it's not what the classic consumer does, see my comment on it on the other PR that is also attempting this #17516 (comment)
Thoughts?

kirktrue · 2024-10-17T17:41:24Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+            applicationEventHandler.add(checkAndUpdatePositionsEvent);
+            cachedSubscriptionHasAllFetchPositions = processBackgroundEvents(
+                    checkAndUpdatePositionsEvent.future(),
+                    timer, __ -> false
+            );


Suggested change

applicationEventHandler.add(checkAndUpdatePositionsEvent);

cachedSubscriptionHasAllFetchPositions = processBackgroundEvents(

checkAndUpdatePositionsEvent.future(),

timer, __ -> false

);

processBackgroundEvents();

cachedSubscriptionHasAllFetchPositions = applicationEventHandler.addAndGet(checkAndUpdatePositionsEvent);

It can't process processBackgroundEvents() only once before applicationEventHandler.addAndGet. Test will be fail if I change to below, I think a loop for processBackgroundEvents is necessary

This section also failed with we only process once.

kirktrue · 2024-10-17T17:49:36Z

Also, this PR overlaps a lot with PR #17516, right?

kirktrue · 2024-10-17T17:52:20Z

I believe we need to have some filtering in the background event processing logic, because we don't want the checks to inadvertently execute the ConsumerRebalanceListenerCallbackNeededEvent if a rebalance was initiated in the background thread.

FrankYang0529

Hi @m1a2st, thank for the PR. It looks like the PR is similar as #17516. I will close mine. Thanks.

FrankYang0529 · 2024-10-18T04:15:27Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

+            processBackgroundEvents(unsubscribeEvent.future(), timer,
+                    e -> e instanceof InvalidTopicException || e instanceof TopicAuthorizationException || e instanceof GroupAuthorizationException);


I think we need a similar fix for AsyncKafkaConsumer#unsubscribe as well.

m1a2st · 2024-10-18T12:27:12Z

Hello @FrankYang0529 feel free to reopen PR, this PR can focus processBackgroundEvents and applicationEventHandler.addAndGet. Due to I can't pass the test, so I fix close method in this PR.

m1a2st · 2024-10-18T16:18:52Z

Do we need to perform this same check in more places than just the handful in this PR?

If we want process all background event from the backgroundEventQueue, I think when we call applicationEventHandler.addAndGet we always need to call processBackgroundEvents first, check the backgroundEventQueue doesn't have any error event.

kirktrue · 2024-10-30T22:12:52Z

@m1a2st—I tested this fix by merging the changes in this PR with the changes from my PR that sets the default group.protocol value to CONSUMER. Unfortunately, the integration tests that exercise authorization policies fail even with this fix 😢

If you want to test that your change works across our topic and consumer group authorization integration tests, do the following:

Change the default for group.protocol in ConsumerConfig
Run the integration tests which subclass EndToEndAuthorizationTest

m1a2st · 2024-10-31T12:18:05Z

@kirktrue , Thanks for your reminder, I will take a look at these fail tests

# Conflicts: # core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala

m1a2st · 2024-11-20T12:15:35Z

Hello @lianetm, @kirktrue I will focus on NetworkClientDelegate#maybePropagateMetadataError in this PR, and revert CoordinatorRequestManager fatal error this PR, and mark fail tests which fail reason is CoordinatorRequestManager for getTestQuorumAndGroupProtocolParametersClassicGroupProtocolOnly_KAFKA_18034
Is it make sense?

lianetm · 2024-11-20T15:28:31Z

I will focus on NetworkClientDelegate#maybePropagateMetadataError in this PR, and revert CoordinatorRequestManager fatal error this PR, and mark fail tests which fail reason is CoordinatorRequestManager for getTestQuorumAndGroupProtocolParametersClassicGroupProtocolOnly_KAFKA_18034

Sounds great to me! So I expect we'll end up with this PR addressing how we propagate metadata errors within the background thread to fail requests that should be aware of the error (should unblock all auth tests expecting TopicAuth error in api calls). Another PR addressing how we propagate coordinator errors within the background to fail requests similarly (unblock tests expecting GroupAuthErrors in api calls)

This reverts commit 6c4c53b

…colOnly_KAFKA_18034

m1a2st · 2024-11-20T17:40:50Z

don't yet understand the need for passing the metadata error around in a Future. And I'm also still wondering if this could be handled at a lower layer so that we don't have to have bespoke code in the request managers to deal with it.

The rationale behind this design is that when ConsumerNetworkThread#processApplicationEvents executes checkAndUpdatePositionsEvent, the TopicAuthorizationException doesn’t surface immediately. Instead, it may require several iterations of runOnce for the error to become apparent. Without using the future-based approach, it would be impossible to propagate this error from the background thread to the OffsetsRequestManager.

m1a2st · 2024-11-20T17:42:09Z

I’m thinking that some test fail for methods like consumer.poll, which involve processBackgroundEvent, if a TopicAuthorizationException occurs, these two error handling mechanisms might conflict, leading to behavior that deviates from expectations.

lianetm · 2024-11-20T19:54:08Z

Hey @m1a2st, sharing a thought in case it helps. First, the problem we have is that api calls like position/endOffsets trigger events that should fail with topic metadata errors but they don't, and are left hanging until they time out. So, with that in mind, it occurred to me that we do have all the events that are awaiting responses in hand when then ConsumerNetworkThread.runOnce happens, because we have them within the reaper, that keeps all the completableEvents so they can be expired eventually. Couldn't we take those events and let them know about the error when it happens? Then each event decides if it should fail on topic metadata error or not. I'm picturing something along these lines:

On ConsumerNetworkThread.runOnce:

        // 1. get metadata error that happens here
        networkClientDelegate.poll(pollWaitTimeMs, currentTimeMs);
        ...
        // 2. get all awaiting events after expiration applies (the reaper has them all, not just the ones generated on the current runOnce)
        List<CompletableApplicationEvent> awaitingEvents = reapExpiredApplicationEvents(currentTimeMs);

        // 3. notify awaiting events about the metadata error
        if (metadataError != null) {
            awaitingEvents.forEach(e -> e.onMetadataError(metadataError));
        }

Would that work? I see that the main advantages would be to avoid the complexity of metadata future errors passed around to specific manager calls, and also it would be a solution applied consistently to all events (each event type then deciding if it should fail or not on topic metadata errors). onMetadataError, events could no-op by default, and some should override to simply do future.completeExceptionally, ex. CheckAndUpdatePositionsEvent, CommitEvent (these two seem to be the ones leading to the failed tests in the Authorizer file, we can get into details later about what others should consider the error).

I could be missing something but sharing in case it helps! Let me know.

lianetm · 2024-11-21T02:46:53Z

I’m thinking that some test fail for methods like consumer.poll, which involve processBackgroundEvent, if a TopicAuthorizationException occurs, these two error handling mechanisms might conflict, leading to behavior that deviates from expectations.

Sorry I missed this comment before. Great point, the issue is that with this PR (no matter how we implement it) we end up failing api calls/events on metadata errors, but still also keeping the previous logic that generated an ErrorEvent for them.

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/NetworkClientDelegate.java

Line 157 in e73edce

backgroundEventHandler.add(new ErrorEvent(e));

We were propagating metadata errors via ErrorEvent thinking that it was only meant to be consumed from poll (which was a wrong assumption). If, with this PR, we introduce a mechanism to propagate it via the api events, I wonder if we should consider removing the redundant ErrorEvent for this case? (without ErrorEvent, poll would still fail as expected, because the CheckAndUpdatePositions would fail with the auth error)

m1a2st · 2024-11-22T13:18:29Z

Hello @lianetm, Sorry for the late response.

Would that work? I see that the main advantages would be to avoid the complexity of metadata future errors passed around to specific manager calls, and also it would be a solution applied consistently to all events (each event type then deciding if it should fail or not on topic metadata errors).

I think this approach is great significantly simplifies the system by eliminating the need to pass CompletedFuture around, which reduces complexity. Also, based on current testing, the failing tests are still just these few.

lianetm · 2024-11-22T21:39:48Z

Hey @m1a2st , just FYI, we just enabled some auth tests that were marked as blocked on this issue, but are really not blocked on this (#17885). So just merge trunk latest changes, and we can have this PR addressing/enabling only what's really related to this fix. Thanks!

m1a2st · 2024-11-23T08:48:26Z

Hello @lianetm, Thanks for your review.

We were propagating metadata errors via ErrorEvent thinking that it was only meant to be consumed from poll (which was a wrong assumption). If, with this PR, we introduce a mechanism to propagate it via the api events, I wonder if we should consider removing the redundant ErrorEvent for this case? (without ErrorEvent, poll would still fail as expected, because the CheckAndUpdatePositions would fail with the auth error)

Based on this issue, the most straightforward solution I can think of at the moment is to add a new attribute in the event to determine whether the method call requires the use of the completedFuture for transmission. I have already drafted a version for this approach. WDYT?

github-actions bot added consumer clients small Small PRs labels Oct 10, 2024

m1a2st commented Oct 10, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java Outdated Show resolved Hide resolved

m1a2st requested review from lianetm and kirktrue October 10, 2024 06:57

kirktrue added KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) labels Oct 14, 2024

kirktrue reviewed Oct 15, 2024

View reviewed changes

when closing add new check condition

e4fec8c

rough to solve this problem, but in close there are another problem

m1a2st force-pushed the KAFKA-17696 branch from 8a1f036 to e4fec8c Compare October 16, 2024 06:02

m1a2st added 2 commits October 16, 2024 22:28

fix all fail test by processBackgroundEvents

e845101

update consumer test from KAFKA-17337

498503c

github-actions bot added core Kafka Broker and removed small Small PRs labels Oct 17, 2024

m1a2st added 3 commits October 17, 2024 23:29

Merge branch 'trunk' into KAFKA-17696

83e79d4

# Conflicts: # core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala

revert unused change

83d0ebf

revert unused change

bda4f20

m1a2st marked this pull request as ready for review October 17, 2024 15:34

kirktrue suggested changes Oct 17, 2024

View reviewed changes

kirktrue mentioned this pull request Oct 17, 2024

KAFKA-17648: AsyncKafkaConsumer#unsubscribe swallow TopicAuthorizationException and GroupAuthorizationException #17516

Merged

3 tasks

FrankYang0529 reviewed Oct 18, 2024

View reviewed changes

Merge branch 'trunk' into KAFKA-17696

4ae7faf

# Conflicts: # core/src/test/scala/integration/kafka/api/AuthorizerIntegrationTest.scala

remove AtomicReference

0ff7deb

m1a2st added 7 commits November 20, 2024 23:49

Revert "fix the error flow and addressed the comment"

1fea361

This reverts commit 6c4c53b

revert change

ef2e152

revert change

83fc619

revert change

a5ef8d9

revert change

00cc9a0

revert change

a8c642f

add test for getTestQuorumAndGroupProtocolParametersClassicGroupProto…

e73edce

…colOnly_KAFKA_18034

m1a2st added 4 commits November 22, 2024 22:25

addressed by comment

1016683

addressed by comment

ba61387

init metadata error

f311b13

Merge branch 'trunk' into KAFKA-17696

3178919

m1a2st added 2 commits November 23, 2024 09:29

temp

a814e9d

Merge branch 'trunk' into KAFKA-17696

cd7fc8b

github-actions bot added the small Small PRs label Nov 23, 2024

m1a2st added 5 commits November 23, 2024 12:07

Merge branch 'trunk' into KAFKA-17696

7a58504

update test

38f76aa

revert unused change

3e69e1b

revert unused change

3a79ae4

draft application thread to tell background thread which call the event

bb67c22

draft application thread to tell background thread which call the event

8b9458d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-17696 New consumer background operations unaware of metadata errors #17440

KAFKA-17696 New consumer background operations unaware of metadata errors #17440

m1a2st commented Oct 10, 2024

kirktrue left a comment

m1a2st commented Oct 16, 2024 •

edited

Loading

kirktrue left a comment

kirktrue Oct 17, 2024

m1a2st Oct 18, 2024

kirktrue Oct 17, 2024

FrankYang0529 Oct 18, 2024

lianetm Nov 1, 2024

kirktrue Oct 17, 2024

m1a2st Oct 18, 2024

kirktrue commented Oct 17, 2024

kirktrue commented Oct 17, 2024

FrankYang0529 left a comment

FrankYang0529 Oct 18, 2024

m1a2st commented Oct 18, 2024 •

edited

Loading

m1a2st commented Oct 18, 2024

kirktrue commented Oct 30, 2024

m1a2st commented Oct 31, 2024 •

edited

Loading

m1a2st commented Nov 20, 2024

lianetm commented Nov 20, 2024

m1a2st commented Nov 20, 2024

m1a2st commented Nov 20, 2024

lianetm commented Nov 20, 2024

lianetm commented Nov 21, 2024 •

edited

Loading

m1a2st commented Nov 22, 2024

lianetm commented Nov 22, 2024

m1a2st commented Nov 23, 2024

		processBackgroundEvents(unsubscribeEvent.future(), timer,
		e -> e instanceof InvalidTopicException \|\| e instanceof TopicAuthorizationException \|\| e instanceof GroupAuthorizationException);

KAFKA-17696 New consumer background operations unaware of metadata errors #17440

Are you sure you want to change the base?

KAFKA-17696 New consumer background operations unaware of metadata errors #17440

Conversation

m1a2st commented Oct 10, 2024

Committer Checklist (excluded from commit message)

kirktrue left a comment

Choose a reason for hiding this comment

m1a2st commented Oct 16, 2024 • edited Loading

kirktrue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kirktrue commented Oct 17, 2024

kirktrue commented Oct 17, 2024

FrankYang0529 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1a2st commented Oct 18, 2024 • edited Loading

m1a2st commented Oct 18, 2024

kirktrue commented Oct 30, 2024

m1a2st commented Oct 31, 2024 • edited Loading

m1a2st commented Nov 20, 2024

lianetm commented Nov 20, 2024

m1a2st commented Nov 20, 2024

m1a2st commented Nov 20, 2024

lianetm commented Nov 20, 2024

lianetm commented Nov 21, 2024 • edited Loading

m1a2st commented Nov 22, 2024

lianetm commented Nov 22, 2024

m1a2st commented Nov 23, 2024

m1a2st commented Oct 16, 2024 •

edited

Loading

m1a2st commented Oct 18, 2024 •

edited

Loading

m1a2st commented Oct 31, 2024 •

edited

Loading

lianetm commented Nov 21, 2024 •

edited

Loading