SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

bzablocki · 2024-10-29T15:22:52Z

Addresses #32596

This PR fixes a bug where messages get stuck on the Solace queue because the checkpoint's finalize method, responsible for acknowledging message processing, isn't always called.

Revisited approach (deprecated approach below):

In advance(), add received messages to receivedMessages list.
in the getCheckpointMark() method, all the receivedMessages are moved to the safeToAckMessages list. The safeToAckList is passed as a reference to the CheckpointMark object.
Each CheckpointMark object created from a given Reader has a reference to the same safeToAckMessages list. On the finalizeCheckpoint() method call, it will acknowledge all messages available for acknowledgement.
The practive acknowledgemnt of safeToAckMessages happens also in the advance() method and the close(). This means that whenever the getCheckpointMark() method marks messages as ready for checkpoint, they can be acknowledged in either of 3 places: finalizeCheckpoint(), advance() and close().

Deprecated approach:
The approach is similar to the one in PubSubIO. The flow there is the following:

in the advance(), we copy message to safeToAckIds list (source)
when the runner request a checkpoint, we create it with a copy of safeToAck list (source)
when the checkpoint is finalized, we add the messages to a queue of finalized messages (source)
in the advance() method, remove whatever is in the queue of finalized messages from the safeToAckIds (source).

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

bzablocki · 2024-10-29T15:24:06Z

cc @ppawel - since I wasn't able to reproduce the issue you reported, would you mind testing it in your environment?

github-actions · 2024-10-29T16:07:08Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ppawel · 2024-11-11T08:50:34Z

cc @ppawel - since I wasn't able to reproduce the issue you reported, would you mind testing it in your environment?

Sorry for late reply, I haven't had much time recently for this topic, but last week I set up a streaming job in our test environment with your branch, and just left it running consuming same messages as the prod job. This job had the Google Streaming Engine enabled.

Today I checked the job and unfortunately there was over 2k messages spooled in the Solace queue. I haven't investigated deeper what is exactly the behavior of checkpointing with the new branch, hopefully I will get more time again for this as we really want to fully switch from JmsIO to SolaceIO, and preferably with Streaming Engine enabled.

sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/UnboundedSolaceReader.java

sjvanrossum · 2024-11-12T14:23:40Z

sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/UnboundedSolaceReader.java

+    // It's possible for a checkpoint to be taken but never finalized.
+    // So we simply copy whatever safeToAckIds we currently have.
+    Map<Long, BytesXMLMessage> snapshotSafeToAckMessages = Maps.newHashMap(safeToAckMessages);
+    return new SolaceCheckpointMark(this::markAsAcked, active, snapshotSafeToAckMessages);


The checkpoint mark does not need to be aware of whether the reader is or isn't active. The call to close may be observed by the finalizing thread to first close the underlying reader before atomically setting the active flag to false. Put a try/catch statement around the acknowledgement loop in the checkpoint mark's finalizer, if it trips on IllegalStateException, then the reader has been closed and none of the queued elements can be acknowledged.

Good point, but I added the try catch only around the ackMessage method call. I think it could fail for other reasons, than a closed reader.

sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/SolaceCheckpointMark.java

… called

…tion strategy and close services more eagerly.

…cBool from the reader.

bzablocki · 2024-11-19T16:41:35Z

I believe that messages are left unread in the queue because some sessions aren't closed correctly. Solace documentation says that dropped sessions have their unacknowledged messages redelivered.
To fix this, I've added caching with an eviction policy. This should remove inactive sessions, forcing Solace to redeliver any unacknowledged messages. @sjvanrossum what do you think?

bzablocki · 2024-11-21T10:46:23Z

In the meantime, while we wait for a review, @ppawel would you be able to test this change?

sjvanrossum · 2024-11-21T11:07:22Z

@bzablocki I think this can be simplified by storing the message handles for the last emitted checkpoint mark in the reader and having both the reader and the checkpoint mark attempt to acknowledge those messages. The reader would attempt to do this at some point in advance (preferably by submitting a cleanup task to a background thread) and lastly in close before closing the reader. The checkpoint mark remains unmodified and would attempt to acknowledge messages in finalizeCheckpoint as it does today. Have a look at my comment about simplifying the safe-to-ack exchange between the reader and the checkpoint mark.

bzablocki · 2024-11-21T14:00:53Z

Done. We can consider delegating the ack to an async thread, but that can be added in a future PR.
Ready for a review @sjvanrossum and for a test run @ppawel :)

.../solace/src/main/java/org/apache/beam/sdk/io/solace/broker/BasicAuthJcsmpSessionService.java

sjvanrossum · 2024-11-22T16:24:54Z

sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/UnboundedSolaceReader.java

+  static {
+    Duration cacheExpirationTimeout = Duration.ofMinutes(1);
+    sessionServiceCache =
+        CacheBuilder.newBuilder()
+            .expireAfterAccess(cacheExpirationTimeout)
+            .removalListener(
+                (RemovalNotification<UUID, SessionService> notification) -> {
+                  LOG.info(
+                      "SolaceIO.Read: Closing session for the reader with uuid {} as it has been idle for over {}.",
+                      notification.getKey(),
+                      cacheExpirationTimeout);
+                  SessionService sessionService = notification.getValue();
+                  if (sessionService != null) {
+                    sessionService.close();
+                  }
+                })
+            .build();
+
+    startCleanUpThread();
+  }


Was this cache necessary still now that message acks are attempted whenever possible? Binding the lifetime of the session to a reader should be fine since a reader's lifetime is generally quite long.

The problem I have is that I can't reproduce the issue @ppawel was reporting, so I thought I would try this approach in conjunction with acking messages whenever possible.
Also, without this the session was only closed in the UnboundedSolaceReader#close method, which was never called in my experiments - It was not called when I drained or updated a pipeline, or when returned false in the advance() method. That's why I implemented this caching with removal listener to more eagerly close any open sessions.

ppawel · 2024-11-25T18:27:53Z

I will test it some time this week in our environment and come back with feedback, thanks for working on this.

ppawel · 2024-11-28T07:55:56Z

@bzablocki We have this PR running in our test environment and so far looks suspiciously good ;) What we tested so far is:

Consuming a large number of spooled messages (they spooled since the previous pipeline was shut down until the new one started with the fresh PR version) - that worked fine, consumed almost immediately.
Peak in messages - with JmsIO we have a problem that processing and acking those messages during peak evening hours pretty much always gets behind even though there are enough workers to handle it.. so we end up with a large spool. With this current SolaceIO version the spool is basically empty which is our main goal with this particular pipeline.
"Business as usual" during the day - this also looks good, pipeline doesn't get behind and messages are not left in the queue unacked.

So overall it looks good, we will definitely keep testing this version, especially points (1) and (2) we need to make sure those are stable before we replace JmsIO.

The only thing I have so far is these logs right at the startup yesterday:

I'm not sure if it's by design but just seems strange that there are messages in the queue (see above point 1) when this new pipeline starts but for some reason the connections to Solace need to be reestablished after 1 minute of idle time for it to start consuming.

bzablocki · 2024-11-28T10:45:17Z

Thank you @ppawel for testing, great to hear it's looking good! Would you say it's ready to merge?

seems strange that there are messages in the queue (see above point 1) when this new pipeline starts but for some reason the connections to Solace need to be reestablished after 1 minute of idle time for it to start consuming.

If it's only at the startup then it is probably because it takes some time for from creating an UnboundedSolaceReader (when a client is created for the first time) to the client being actually used. If it takes more than 1 minute (hard-coded max idle time), it will be closed, removed and recreated when needed. I think this is acceptable.

ppawel · 2024-11-28T10:51:12Z

Thank you @ppawel for testing, great to hear it's looking good! Would you say it's ready to merge?

I haven't had time to look at the code, I'm also not a Beam internals expert so hard to say but as a user it looks good so far. We will be running it through the weekend at least, as there is the biggest peak in messages on Saturdays.

Up to you if you want/can wait until next week for more test reports, alternative is that you merge it and I can open a separate issue if anything comes up from our tests.

Just a question regarding releases - is it fair to assume that it's going to be included in 2.62.0?

bzablocki · 2024-11-28T11:15:17Z

It will be available in 2.62.0, as long as we merge it before ~Dec 18. Let's see how it behaves over the weekend and if everything is fine we can merge it next week.

bzablocki · 2024-11-28T11:15:39Z

assign set of reviewers

github-actions · 2024-11-28T11:18:28Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @shunping for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

ppawel · 2024-12-02T08:09:51Z

It will be available in 2.62.0, as long as we merge it before ~Dec 18. Let's see how it behaves over the weekend and if everything is fine we can merge it next week.

Our tests look quite good after the weekend. We are now considering rolling out SolaceIO to prod ASAP due to additional new issues with JmsIO... do you think it is safe to apply your PR patch on top of Beam 2.61.0 (as we don't want to roll out a 2.62 snapshot version to prod)?

bzablocki · 2024-12-02T10:17:56Z

I think it should be fine, but I'd appreciate if @Abacn or @shunping reviewed it first :)

shunping · 2024-12-02T15:37:02Z

LGTM regarding io. Thanks!

bzablocki · 2024-12-02T17:03:17Z

I'm afraid this introduced flakiness in the integration test and potentially errors in the reader. I will revert this commit and work on a fix asap.
@ppawel please wait for the fix.

FYI, the integration tests sometimes crash on this line, as the client can be closed while it reads from the broker.
https://github.com/bzablocki/beam/blob/2d48c224d21601a64d65e9082cfaa49bb638672f/sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/UnboundedSolaceReader.java#L145

…t is not executed (apache#32962)" This reverts commit e279e55.

…t is not executed (#32962)" (#33259) This reverts commit e279e55.

bzablocki · 2024-12-03T13:08:44Z

Sorry for that, false alarm, but better safe than sorry. The flakiness was introduced in another PR that I was working on. This looks good. I'll create a new PR to reintroduce these changes.

…nt is not executed (apache#32962)" (apache#33259) This reverts commit 4356253.

…nt is not executed (#32962)" (#33259) (#33269) This reverts commit 4356253.

github-actions bot added java io labels Oct 29, 2024

sjvanrossum reviewed Nov 12, 2024

View reviewed changes

sdks/java/io/solace/src/main/java/org/apache/beam/sdk/io/solace/read/SolaceCheckpointMark.java Outdated Show resolved Hide resolved

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not…

ebce33f

… called

bzablocki force-pushed the solace-checkpoint-finzalization branch 4 times, most recently from 42a0cbd to 740646d Compare November 19, 2024 15:26

Implement cache for storing the session service and implement an evic…

da95027

…tion strategy and close services more eagerly.

bzablocki force-pushed the solace-checkpoint-finzalization branch from 740646d to da95027 Compare November 19, 2024 15:33

Wrap message acknowledgment in a try-catch block, remove active Atomi…

48520ea

…cBool from the reader.

bzablocki force-pushed the solace-checkpoint-finzalization branch from fa98b30 to 48520ea Compare November 19, 2024 16:04

Store messages in a Queue that is referenced from the CheckpointMark

2d48c22

bzablocki force-pushed the solace-checkpoint-finzalization branch from 036dd62 to 2d48c22 Compare November 21, 2024 14:18

sjvanrossum reviewed Nov 21, 2024

View reviewed changes

.../solace/src/main/java/org/apache/beam/sdk/io/solace/broker/BasicAuthJcsmpSessionService.java Show resolved Hide resolved

sjvanrossum reviewed Nov 22, 2024

View reviewed changes

github-actions bot added the Next Action: Reviewers label Nov 28, 2024

bzablocki mentioned this pull request Nov 29, 2024

SolaceIO.Read: Make sure the messageReceiver is always initialized. #32964

Draft

3 tasks

Abacn merged commit e279e55 into apache:master Dec 2, 2024
20 checks passed

bzablocki added a commit to bzablocki/beam that referenced this pull request Dec 2, 2024

Revert "SolaceIO.Read: handle occasional cases when finalizeCheckpoin…

3a60f42

…t is not executed (apache#32962)" This reverts commit e279e55.

bzablocki mentioned this pull request Dec 2, 2024

Revert "SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed (#32962)" #33259

Merged

3 tasks

Abacn pushed a commit that referenced this pull request Dec 2, 2024

Revert "SolaceIO.Read: handle occasional cases when finalizeCheckpoin…

4356253

…t is not executed (#32962)" (#33259) This reverts commit e279e55.

bzablocki added a commit to bzablocki/beam that referenced this pull request Dec 3, 2024

Reapply "SolaceIO.Read: handle occasional cases when finalizeCheckpoi…

ec48a58

…nt is not executed (apache#32962)" (apache#33259) This reverts commit 4356253.

bzablocki mentioned this pull request Dec 3, 2024

Reapply "SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed (#32962)" (#33259) #33269

Merged

3 tasks

Abacn pushed a commit that referenced this pull request Dec 11, 2024

Reapply "SolaceIO.Read: handle occasional cases when finalizeCheckpoi…

e2ff659

…nt is not executed (#32962)" (#33259) (#33269) This reverts commit 4356253.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

bzablocki commented Oct 29, 2024 •

edited

Loading

bzablocki commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

ppawel commented Nov 11, 2024

sjvanrossum Nov 12, 2024

bzablocki Nov 19, 2024 •

edited

Loading

bzablocki commented Nov 19, 2024

bzablocki commented Nov 21, 2024

sjvanrossum commented Nov 21, 2024 •

edited

Loading

bzablocki commented Nov 21, 2024

sjvanrossum Nov 22, 2024 •

edited

Loading

bzablocki Nov 25, 2024

ppawel commented Nov 25, 2024

ppawel commented Nov 28, 2024

bzablocki commented Nov 28, 2024

ppawel commented Nov 28, 2024

bzablocki commented Nov 28, 2024

bzablocki commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

ppawel commented Dec 2, 2024

bzablocki commented Dec 2, 2024

shunping commented Dec 2, 2024 •

edited

Loading

bzablocki commented Dec 2, 2024

bzablocki commented Dec 3, 2024

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

SolaceIO.Read: handle occasional cases when finalizeCheckpoint is not executed #32962

Conversation

bzablocki commented Oct 29, 2024 • edited Loading

GitHub Actions Tests Status (on master branch)

bzablocki commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

ppawel commented Nov 11, 2024

sjvanrossum Nov 12, 2024

Choose a reason for hiding this comment

bzablocki Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

bzablocki commented Nov 19, 2024

bzablocki commented Nov 21, 2024

sjvanrossum commented Nov 21, 2024 • edited Loading

bzablocki commented Nov 21, 2024

sjvanrossum Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

bzablocki Nov 25, 2024

Choose a reason for hiding this comment

ppawel commented Nov 25, 2024

ppawel commented Nov 28, 2024

bzablocki commented Nov 28, 2024

ppawel commented Nov 28, 2024

bzablocki commented Nov 28, 2024

bzablocki commented Nov 28, 2024

github-actions bot commented Nov 28, 2024

ppawel commented Dec 2, 2024

bzablocki commented Dec 2, 2024

shunping commented Dec 2, 2024 • edited Loading

bzablocki commented Dec 2, 2024

bzablocki commented Dec 3, 2024

bzablocki commented Oct 29, 2024 •

edited

Loading

bzablocki Nov 19, 2024 •

edited

Loading

sjvanrossum commented Nov 21, 2024 •

edited

Loading

sjvanrossum Nov 22, 2024 •

edited

Loading

shunping commented Dec 2, 2024 •

edited

Loading