-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Cross-Bundle BatchElements #29175
Conversation
Run Python_Coverage PreCommit |
Run Python_Dataframes PreCommit |
Run Python_Examples PreCommit |
Run Python_Transforms PreCommit |
Codecov Report
@@ Coverage Diff @@
## master #29175 +/- ##
==========================================
- Coverage 38.38% 38.31% -0.08%
==========================================
Files 686 688 +2
Lines 101640 101907 +267
==========================================
+ Hits 39018 39041 +23
- Misses 61042 61286 +244
Partials 1580 1580
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 20 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
R: @damccorm |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Run Python_Coverage PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments (one non-blocking for the future), otherwise LGTM
|
||
def expand(self, pcoll): | ||
if getattr(pcoll.pipeline.runner, 'is_streaming', False): | ||
raise NotImplementedError("Requires stateful processing (BEAM-2687)") | ||
elif self._max_batch_dur is not None: | ||
coder = coders.registry.get_coder(pcoll) | ||
return pcoll | WithKeys(0) | ParDo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking for this PR, but something we may want to consider; rather than using a single fixed key, does it make sense to try to have a single key per worker somehow? (one way to do this would be using multi_process_shared.py)
That way we're still batching per machine in a parallelizable way, but we get stateful batching across bundles.
The current implementation is likely still useful for many use cases where batching is not the expensive part (e.g. RunInference) or there are few workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I suggested keying with worker IDs, it may be worth coming back to that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed that (or forgot), hopefully I wasn't against it initially?
Regardless, I like having this as our first pass, we can see how it performs and go from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once checks pass
Implements a stateful version of
BatchElements
that works across bundles, allowing for streaming pipelines that can batch elements in a dynamic fashion.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.