Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

Merged
merged 4 commits into from
Dec 2, 2024

Conversation

bzablocki
Copy link
Contributor

@bzablocki bzablocki commented Oct 29, 2024

Addresses #32596

This PR fixes a bug where messages get stuck on the Solace queue because the checkpoint's finalize method, responsible for acknowledging message processing, isn't always called.

Revisited approach (deprecated approach below):

  1. In advance(), add received messages to receivedMessages list.
  2. in the getCheckpointMark() method, all the receivedMessages are moved to the safeToAckMessages list. The safeToAckList is passed as a reference to the CheckpointMark object.
  3. Each CheckpointMark object created from a given Reader has a reference to the same safeToAckMessages list. On the finalizeCheckpoint() method call, it will acknowledge all messages available for acknowledgement.
  4. The practive acknowledgemnt of safeToAckMessages happens also in the advance() method and the close(). This means that whenever the getCheckpointMark() method marks messages as ready for checkpoint, they can be acknowledged in either of 3 places: finalizeCheckpoint(), advance() and close().

Deprecated approach:
The approach is similar to the one in PubSubIO. The flow there is the following:

  1. in the advance(), we copy message to safeToAckIds list (source)
  2. when the runner request a checkpoint, we create it with a copy of safeToAck list (source)
  3. when the checkpoint is finalized, we add the messages to a queue of finalized messages (source)
  4. in the advance() method, remove whatever is in the queue of finalized messages from the safeToAckIds (source).

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@bzablocki
Copy link
Contributor Author

cc @ppawel - since I wasn't able to reproduce the issue you reported, would you mind testing it in your environment?

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ppawel
Copy link

ppawel commented Nov 11, 2024

cc @ppawel - since I wasn't able to reproduce the issue you reported, would you mind testing it in your environment?

Sorry for late reply, I haven't had much time recently for this topic, but last week I set up a streaming job in our test environment with your branch, and just left it running consuming same messages as the prod job. This job had the Google Streaming Engine enabled.

Today I checked the job and unfortunately there was over 2k messages spooled in the Solace queue. I haven't investigated deeper what is exactly the behavior of checkpointing with the new branch, hopefully I will get more time again for this as we really want to fully switch from JmsIO to SolaceIO, and preferably with Streaming Engine enabled.

// It's possible for a checkpoint to be taken but never finalized.
// So we simply copy whatever safeToAckIds we currently have.
Map<Long, BytesXMLMessage> snapshotSafeToAckMessages = Maps.newHashMap(safeToAckMessages);
return new SolaceCheckpointMark(this::markAsAcked, active, snapshotSafeToAckMessages);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpoint mark does not need to be aware of whether the reader is or isn't active. The call to close may be observed by the finalizing thread to first close the underlying reader before atomically setting the active flag to false. Put a try/catch statement around the acknowledgement loop in the checkpoint mark's finalizer, if it trips on IllegalStateException, then the reader has been closed and none of the queued elements can be acknowledged.

Copy link
Contributor Author

@bzablocki bzablocki Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but I added the try catch only around the ackMessage method call. I think it could fail for other reasons, than a closed reader.

@bzablocki bzablocki force-pushed the solace-checkpoint-finzalization branch 4 times, most recently from 42a0cbd to 740646d Compare November 19, 2024 15:26
…tion strategy and close services more eagerly.
@bzablocki bzablocki force-pushed the solace-checkpoint-finzalization branch from 740646d to da95027 Compare November 19, 2024 15:33
@bzablocki bzablocki force-pushed the solace-checkpoint-finzalization branch from fa98b30 to 48520ea Compare November 19, 2024 16:04
@bzablocki
Copy link
Contributor Author

I believe that messages are left unread in the queue because some sessions aren't closed correctly. Solace documentation says that dropped sessions have their unacknowledged messages redelivered.
To fix this, I've added caching with an eviction policy. This should remove inactive sessions, forcing Solace to redeliver any unacknowledged messages. @sjvanrossum what do you think?

@bzablocki
Copy link
Contributor Author

In the meantime, while we wait for a review, @ppawel would you be able to test this change?

@sjvanrossum
Copy link
Contributor

sjvanrossum commented Nov 21, 2024

@bzablocki I think this can be simplified by storing the message handles for the last emitted checkpoint mark in the reader and having both the reader and the checkpoint mark attempt to acknowledge those messages. The reader would attempt to do this at some point in advance (preferably by submitting a cleanup task to a background thread) and lastly in close before closing the reader. The checkpoint mark remains unmodified and would attempt to acknowledge messages in finalizeCheckpoint as it does today. Have a look at my comment about simplifying the safe-to-ack exchange between the reader and the checkpoint mark.

@bzablocki
Copy link
Contributor Author

Done. We can consider delegating the ack to an async thread, but that can be added in a future PR.
Ready for a review @sjvanrossum and for a test run @ppawel :)

@bzablocki bzablocki force-pushed the solace-checkpoint-finzalization branch from 036dd62 to 2d48c22 Compare November 21, 2024 14:18
Comment on lines +78 to +97
static {
Duration cacheExpirationTimeout = Duration.ofMinutes(1);
sessionServiceCache =
CacheBuilder.newBuilder()
.expireAfterAccess(cacheExpirationTimeout)
.removalListener(
(RemovalNotification<UUID, SessionService> notification) -> {
LOG.info(
"SolaceIO.Read: Closing session for the reader with uuid {} as it has been idle for over {}.",
notification.getKey(),
cacheExpirationTimeout);
SessionService sessionService = notification.getValue();
if (sessionService != null) {
sessionService.close();
}
})
.build();

startCleanUpThread();
}
Copy link
Contributor

@sjvanrossum sjvanrossum Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this cache necessary still now that message acks are attempted whenever possible? Binding the lifetime of the session to a reader should be fine since a reader's lifetime is generally quite long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have is that I can't reproduce the issue @ppawel was reporting, so I thought I would try this approach in conjunction with acking messages whenever possible.
Also, without this the session was only closed in the UnboundedSolaceReader#close method, which was never called in my experiments - It was not called when I drained or updated a pipeline, or when returned false in the advance() method. That's why I implemented this caching with removal listener to more eagerly close any open sessions.

@ppawel
Copy link

ppawel commented Nov 25, 2024

I will test it some time this week in our environment and come back with feedback, thanks for working on this.

@ppawel
Copy link

ppawel commented Nov 28, 2024

@bzablocki We have this PR running in our test environment and so far looks suspiciously good ;) What we tested so far is:

  1. Consuming a large number of spooled messages (they spooled since the previous pipeline was shut down until the new one started with the fresh PR version) - that worked fine, consumed almost immediately.
  2. Peak in messages - with JmsIO we have a problem that processing and acking those messages during peak evening hours pretty much always gets behind even though there are enough workers to handle it.. so we end up with a large spool. With this current SolaceIO version the spool is basically empty which is our main goal with this particular pipeline.
  3. "Business as usual" during the day - this also looks good, pipeline doesn't get behind and messages are not left in the queue unacked.

So overall it looks good, we will definitely keep testing this version, especially points (1) and (2) we need to make sure those are stable before we replace JmsIO.

The only thing I have so far is these logs right at the startup yesterday:

image

I'm not sure if it's by design but just seems strange that there are messages in the queue (see above point 1) when this new pipeline starts but for some reason the connections to Solace need to be reestablished after 1 minute of idle time for it to start consuming.

@bzablocki
Copy link
Contributor Author

Thank you @ppawel for testing, great to hear it's looking good! Would you say it's ready to merge?

seems strange that there are messages in the queue (see above point 1) when this new pipeline starts but for some reason the connections to Solace need to be reestablished after 1 minute of idle time for it to start consuming.

If it's only at the startup then it is probably because it takes some time for from creating an UnboundedSolaceReader (when a client is created for the first time) to the client being actually used. If it takes more than 1 minute (hard-coded max idle time), it will be closed, removed and recreated when needed. I think this is acceptable.

@ppawel
Copy link

ppawel commented Nov 28, 2024

Thank you @ppawel for testing, great to hear it's looking good! Would you say it's ready to merge?

I haven't had time to look at the code, I'm also not a Beam internals expert so hard to say but as a user it looks good so far. We will be running it through the weekend at least, as there is the biggest peak in messages on Saturdays.

Up to you if you want/can wait until next week for more test reports, alternative is that you merge it and I can open a separate issue if anything comes up from our tests.

Just a question regarding releases - is it fair to assume that it's going to be included in 2.62.0?

@bzablocki
Copy link
Contributor Author

It will be available in 2.62.0, as long as we merge it before ~Dec 18. Let's see how it behaves over the weekend and if everything is fine we can merge it next week.

@bzablocki
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @shunping for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@ppawel
Copy link

ppawel commented Dec 2, 2024

It will be available in 2.62.0, as long as we merge it before ~Dec 18. Let's see how it behaves over the weekend and if everything is fine we can merge it next week.

Our tests look quite good after the weekend. We are now considering rolling out SolaceIO to prod ASAP due to additional new issues with JmsIO... do you think it is safe to apply your PR patch on top of Beam 2.61.0 (as we don't want to roll out a 2.62 snapshot version to prod)?

@bzablocki
Copy link
Contributor Author

I think it should be fine, but I'd appreciate if @Abacn or @shunping reviewed it first :)

@shunping
Copy link
Contributor

shunping commented Dec 2, 2024

LGTM regarding io. Thanks!

@Abacn Abacn merged commit e279e55 into apache:master Dec 2, 2024
20 checks passed
@bzablocki
Copy link
Contributor Author

I'm afraid this introduced flakiness in the integration test and potentially errors in the reader. I will revert this commit and work on a fix asap.
@ppawel please wait for the fix.

FYI, the integration tests sometimes crash on this line, as the client can be closed while it reads from the broker.
https://github.com/bzablocki/beam/blob/2d48c224d21601a64d65e9082cfaa49bb638672f/sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/UnboundedSolaceReader.java#L145

bzablocki added a commit to bzablocki/beam that referenced this pull request Dec 2, 2024
Abacn pushed a commit that referenced this pull request Dec 2, 2024
@bzablocki
Copy link
Contributor Author

Sorry for that, false alarm, but better safe than sorry. The flakiness was introduced in another PR that I was working on. This looks good. I'll create a new PR to reintroduce these changes.

bzablocki added a commit to bzablocki/beam that referenced this pull request Dec 3, 2024
Abacn pushed a commit that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants