Add OrderedListState support for SparkRunner #33212

twosom · 2024-11-25T11:19:49Z

Please add a meaningful description for your change here

fixes #33211
fixes #31724
fixes #31723

This PR contains these changes

added OrderedList in SparkStateInternals
implementation equals and hashcode in FlinkOrderedListState
added OrderedListState test
fixed FlinkRunner state test

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-11-25T12:10:47Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

codecov · 2024-11-25T13:31:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.01%. Comparing base (e5bd69d) to head (74033fa).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #33212      +/-   ##
============================================
+ Coverage     57.42%   59.01%   +1.58%     
- Complexity     1475     3162    +1687     
============================================
  Files           970     1136     +166     
  Lines        154525   175078   +20553     
  Branches       1076     3354    +2278     
============================================
+ Hits          88743   103324   +14581     
- Misses        63578    68406    +4828     
- Partials       2204     3348    +1144

Flag	Coverage Δ
java	`70.27% <ø> (+1.69%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

twosom · 2024-12-03T15:06:23Z

Run Flink Container PreCommit

liferoad · 2024-12-04T19:25:42Z

R: @shunping

github-actions · 2024-12-04T19:26:57Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

liferoad · 2024-12-04T20:37:19Z

Did you rebase your branch?

twosom · 2024-12-05T03:53:24Z

Did you rebase your branch?

Thanks!

i rebased now!

twosom · 2024-12-05T07:14:10Z

Run Java_Spark3_Versions PreCommit

liferoad · 2024-12-05T14:22:49Z

Please check this:
:runners:spark:3:analyzeClassesDependencies FAILED

twosom · 2024-12-05T14:55:44Z

@liferoad
Sorry, my bad. Fixed the import statement to use Beam's vendored Guava Lists instead of clearspring's.
Thanks

twosom · 2024-12-05T15:37:29Z

Run Java PreCommit

twosom · 2024-12-15T06:13:36Z

Run Java PreCommit

twosom · 2024-12-19T00:06:34Z

Run Java_GCP_IO_Direct PreCommit

twosom · 2024-12-19T00:06:59Z

Run Java PreCommit

CHANGES.md

kennknowles · 2024-12-19T13:36:56Z

.../org/apache/beam/runners/flink/translation/wrappers/streaming/state/FlinkStateInternals.java

+
+    @Override
+    public int hashCode() {
+      int result = namespace.hashCode();


Can use Objects.hashCode

@kennknowles
The current implementation follows the same hashCode semantics used in FlinkStateInternals. I'm a bit unclear whether suggesting Objects.hashCode means we should replace the current implementation, or if you're suggesting that overriding hashCode is unnecessary altogether.

Overriding equality and hashcode should always occur together, so it is necessary to override hashCode.

I was just suggesting this pattern instead of doing your own math:

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/SuccessOrFailure.java

Line 88 in 7c86bf3

return Objects.hashCode(isSuccess, site, throwable);

.../org/apache/beam/runners/flink/translation/wrappers/streaming/state/FlinkStateInternals.java

kennknowles · 2024-12-19T13:44:30Z

runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkStateInternals.java

+    }
+
+    private SortedMap<Instant, TimestampedValue<T>> readAsMap() {
+      final List<TimestampedValue<T>> listValues =


The reason for each additional kind of state is to efficiently offer a novel form of a state access. The state access here as the same performance characteristics as ValueState. It is actually better for the runner to reject a pipeline than to run it with performance characteristics that don't match the expected performance contract.

Is there some underlying mechanism in Spark that could implement OrderedListState efficiently and scalably?

I agree with your point. Let me share my thoughts on why I chose this implementation.

I've noticed that ListState/OrderedListState is mostly used in situations where writes happen much more frequently than reads. That's why I went with ArrayList instead of SortedMap - it's simply better at handling these frequent writes.

When it comes to reading data, it usually happens in just a couple of scenarios - either during OnTimer execution or when the list hits a certain size. So even if the read performance takes a small hit, it's not really going to affect the overall performance much.

It's also worth mentioning that FlinkOrderedListState uses the same approach, which gives me confidence in this design choice.

That's why I think the current implementation makes more sense for real-world usage patterns.

I see. If Flink is implemented then it is OK with me to follow that precedent. My point was that this does not actually add capability that is more than ValueState provides. It is just a minor API wrapper adjustment - still useful but not the main purpose.

So we can merge with this design. But if you think about following up, here is how we would really like this to behave:

add should call some native Spark API that writes the element without reading the list

readRange should only read the requested range, ideally seeking in near-constant time (aka without a scan or sort)

clearRange should also seek in near-constant time

isEmpty should not read the list

twosom · 2024-12-20T15:22:04Z

Run Java PreCommit

github-actions bot added runners spark core flink labels Nov 25, 2024

twosom force-pushed the spark-ordered-list-state branch from f9ec51d to 2c06551 Compare December 5, 2024 03:52

kennknowles self-requested a review December 5, 2024 14:55

kennknowles requested changes Dec 19, 2024

View reviewed changes

twosom added 5 commits December 20, 2024 23:14

feat: add support OrderedListState for spark runner

59c7626

test: add ordered list state test

3db49e0

feat: implementation flink ordered list state equals and hashcode

81d1790

test: fix flink runner test

b938690

update CHANGES.md

9c5d9ba

twosom force-pushed the spark-ordered-list-state branch from bfea9e1 to 9c5d9ba Compare December 20, 2024 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OrderedListState support for SparkRunner #33212

Add OrderedListState support for SparkRunner #33212

twosom commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

codecov bot commented Nov 25, 2024

twosom commented Dec 3, 2024

liferoad commented Dec 4, 2024

github-actions bot commented Dec 4, 2024

liferoad commented Dec 4, 2024

twosom commented Dec 5, 2024

twosom commented Dec 5, 2024

liferoad commented Dec 5, 2024

twosom commented Dec 5, 2024

twosom commented Dec 5, 2024

twosom commented Dec 15, 2024

twosom commented Dec 19, 2024

twosom commented Dec 19, 2024

kennknowles Dec 19, 2024

twosom Dec 20, 2024

kennknowles Dec 20, 2024

kennknowles Dec 19, 2024

twosom Dec 19, 2024

kennknowles Dec 20, 2024

twosom commented Dec 20, 2024

Add OrderedListState support for SparkRunner #33212

Are you sure you want to change the base?

Add OrderedListState support for SparkRunner #33212

Conversation

twosom commented Nov 25, 2024

GitHub Actions Tests Status (on master branch)

github-actions bot commented Nov 25, 2024

codecov bot commented Nov 25, 2024

Codecov Report

twosom commented Dec 3, 2024

liferoad commented Dec 4, 2024

github-actions bot commented Dec 4, 2024

liferoad commented Dec 4, 2024

twosom commented Dec 5, 2024

twosom commented Dec 5, 2024

liferoad commented Dec 5, 2024

twosom commented Dec 5, 2024

twosom commented Dec 5, 2024

twosom commented Dec 15, 2024

twosom commented Dec 19, 2024

twosom commented Dec 19, 2024

kennknowles Dec 19, 2024

Choose a reason for hiding this comment

twosom Dec 20, 2024

Choose a reason for hiding this comment

kennknowles Dec 20, 2024

Choose a reason for hiding this comment

kennknowles Dec 19, 2024

Choose a reason for hiding this comment

twosom Dec 19, 2024

Choose a reason for hiding this comment

kennknowles Dec 20, 2024

Choose a reason for hiding this comment

twosom commented Dec 20, 2024