Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OrderedListState support for SparkRunner #33212
base: master
Are you sure you want to change the base?
Add OrderedListState support for SparkRunner #33212
Changes from all commits
59c7626
3db49e0
81d1790
b938690
9c5d9ba
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use
Objects.hashCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kennknowles
The current implementation follows the same hashCode semantics used in FlinkStateInternals. I'm a bit unclear whether suggesting
Objects.hashCode
means we should replace the current implementation, or if you're suggesting that overridinghashCode
is unnecessary altogether.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overriding equality and hashcode should always occur together, so it is necessary to override
hashCode
.I was just suggesting this pattern instead of doing your own math:
beam/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/SuccessOrFailure.java
Line 88 in 7c86bf3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for each additional kind of state is to efficiently offer a novel form of a state access. The state access here as the same performance characteristics as
ValueState
. It is actually better for the runner to reject a pipeline than to run it with performance characteristics that don't match the expected performance contract.Is there some underlying mechanism in Spark that could implement OrderedListState efficiently and scalably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point. Let me share my thoughts on why I chose this implementation.
I've noticed that ListState/OrderedListState is mostly used in situations where writes happen much more frequently than reads. That's why I went with ArrayList instead of SortedMap - it's simply better at handling these frequent writes.
When it comes to reading data, it usually happens in just a couple of scenarios - either during OnTimer execution or when the list hits a certain size. So even if the read performance takes a small hit, it's not really going to affect the overall performance much.
It's also worth mentioning that FlinkOrderedListState uses the same approach, which gives me confidence in this design choice.
That's why I think the current implementation makes more sense for real-world usage patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If Flink is implemented then it is OK with me to follow that precedent. My point was that this does not actually add capability that is more than
ValueState
provides. It is just a minor API wrapper adjustment - still useful but not the main purpose.So we can merge with this design. But if you think about following up, here is how we would really like this to behave:
add
should call some native Spark API that writes the element without reading the listreadRange
should only read the requested range, ideally seeking in near-constant time (aka without a scan or sort)clearRange
should also seek in near-constant timeisEmpty
should not read the list