Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Avoid being stuck when closing the broker with extensible load manager #22573

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Apr 24, 2024

Fixes #22569

Motivation

BrokerService#closeAsync calls unloadNamespaceBundlesGracefully to unload namespaces gracefully. With extensible load manager, it eventually calls TableViewLoadDataStoreImpl#validateProducer:

BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer

In validateProducer, if the producer is not connected, it will recreate the producer synchronously. However, since the state of PulsarService has already been changed to Closing, all connect or lookup requests will fail with ServiceNotReady. Then the client will retry until timeout.

Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the loadbalancer-service-unit-state topic.

Modifications

The major fix:
Before changing PulsarService's state to Closing, call BrokerService#unloadNamespaceBundlesGracefully first to make the load manager complete the unload operations first.

Minor fixes:

  • Record the time when LoadManager#disableBroker is done.
  • Don't check if producer is disconnected because the producer could retry if it's disconnected.

Verifications

Add ExtensibleLoadManagerCloseTest to verify closing PulsarService won't take too much time. Here are some test results locally:

2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]

As you can see, each broker takes only about 3 seconds to close due to OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS value added in #20315

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#31

@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug area/broker labels Apr 24, 2024
@BewareMyPower BewareMyPower added this to the 3.3.0 milestone Apr 24, 2024
@BewareMyPower BewareMyPower self-assigned this Apr 24, 2024
@BewareMyPower BewareMyPower changed the title [fix][broker] Avoid being stuck in 30+ seconds when closing the BrokerService [fix][broker] Avoid being stuck too long when closing the broker with extensible load manager Apr 24, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 24, 2024
…rService

Fixes apache#22569

### Motivation

`BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to
unload namespaces gracefully. With extensible load manager, it
eventually calls `TableViewLoadDataStoreImpl#validateProducer`:

```
BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer
```

In `validateProducer`, if the producer is not connected, it will
recreate the producer synchronously. However, since the state of
`PulsarService` has already been changed to `Closing`, all connect or
lookup requests will fail with `ServiceNotReady`. Then the client will
retry until timeout.

Besides, the unload operation could also trigger the reconnection
because the extensible load manager sends the unload event to the
`loadbalancer-service-unit-state` topic.

### Modifications

The major fix:
Before changing PulsarService's state to `Closing`, call
`BrokerService#unloadNamespaceBundlesGracefully` first to make the load
manager complete the unload operations first.

Minor fixes:
- Record the time when `LoadManager#disableBroker` is done.
- Don't check if producer is disconnected because the producer could
  retry if it's disconnected.

### Verifications

Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService`
won't take too much time. Here are some test results locally:

```
2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]
```

As you can see, each broker takes only about 3 seconds to close due to
`OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in
apache#20315
@BewareMyPower BewareMyPower force-pushed the bewaremypower/extensible-lb-close-gracefully branch from cbf5ac0 to a29d3b9 Compare April 24, 2024 12:26
@BewareMyPower BewareMyPower changed the title [fix][broker] Avoid being stuck too long when closing the broker with extensible load manager [fix][broker] Avoid being stuck when closing the broker with extensible load manager Apr 24, 2024
@BewareMyPower BewareMyPower marked this pull request as draft April 25, 2024 03:53
@heesung-sn
Copy link
Contributor

OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS

I think we can actually remove this. This was added to wait for some time after bundles are unloaded, but I don't think it is necessary.

@BewareMyPower
Copy link
Contributor Author

This was added to wait for some time after bundles are unloaded, but I don't think it is necessary.

Agreed. We can remove it in another PR.

@BewareMyPower BewareMyPower marked this pull request as ready for review April 25, 2024 07:53
@lhotari lhotari merged commit f411e3c into apache:master Apr 26, 2024
54 of 56 checks passed
Technoboy- pushed a commit that referenced this pull request May 8, 2024
lhotari pushed a commit that referenced this pull request May 14, 2024
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request May 15, 2024
…le load manager (apache#22573)

(cherry picked from commit f411e3c)
(cherry picked from commit db43414)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request May 16, 2024
…le load manager (apache#22573)

(cherry picked from commit f411e3c)
(cherry picked from commit db43414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Broker could take 30+ seconds to close with extensible load manager
6 participants