Implement Cross-Bundle BatchElements #29175

jrmccluskey · 2023-10-27T18:09:20Z

Implements a stateful version of BatchElements that works across bundles, allowing for streaming pipelines that can batch elements in a dynamic fashion.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

jrmccluskey · 2023-10-30T18:52:30Z

Run Python_Coverage PreCommit

jrmccluskey · 2023-10-30T18:52:38Z

Run Python_Dataframes PreCommit

jrmccluskey · 2023-10-30T18:52:50Z

Run Python_Examples PreCommit

jrmccluskey · 2023-10-30T18:52:58Z

Run Python_Transforms PreCommit

codecov · 2023-10-30T19:16:15Z

Codecov Report

Merging #29175 (a58498c) into master (8911595) will decrease coverage by 0.08%.
Report is 172 commits behind head on master.
The diff coverage is 4.08%.

@@            Coverage Diff             @@
##           master   #29175      +/-   ##
==========================================
- Coverage   38.38%   38.31%   -0.08%     
==========================================
  Files         686      688       +2     
  Lines      101640   101907     +267     
==========================================
+ Hits        39018    39041      +23     
- Misses      61042    61286     +244     
  Partials     1580     1580

Flag	Coverage Δ
python	`29.91% <4.08%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
sdks/python/apache_beam/transforms/util.py	`37.63% <4.08%> (-2.34%)`	⬇️

... and 20 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jrmccluskey · 2023-10-31T18:34:23Z

R: @damccorm

github-actions · 2023-10-31T18:35:48Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

sdks/python/apache_beam/transforms/util.py

jrmccluskey · 2023-10-31T19:35:23Z

Run Python_Coverage PreCommit

damccorm

Just a couple comments (one non-blocking for the future), otherwise LGTM

damccorm · 2023-10-31T21:48:39Z

sdks/python/apache_beam/transforms/util.py


  def expand(self, pcoll):
    if getattr(pcoll.pipeline.runner, 'is_streaming', False):
      raise NotImplementedError("Requires stateful processing (BEAM-2687)")
+    elif self._max_batch_dur is not None:
+      coder = coders.registry.get_coder(pcoll)
+      return pcoll | WithKeys(0) | ParDo(


Non-blocking for this PR, but something we may want to consider; rather than using a single fixed key, does it make sense to try to have a single key per worker somehow? (one way to do this would be using multi_process_shared.py)

That way we're still batching per machine in a parallelizable way, but we get stateful batching across bundles.

The current implementation is likely still useful for many use cases where batching is not the expensive part (e.g. RunInference) or there are few workers.

At one point I suggested keying with worker IDs, it may be worth coming back to that idea

Oh I missed that (or forgot), hopefully I wasn't against it initially?

Regardless, I like having this as our first pass, we can see how it performs and go from there

sdks/python/apache_beam/transforms/util.py

damccorm

LGTM once checks pass

jrmccluskey added 3 commits October 20, 2023 14:59

Initial StatefulBatchElements transform w/ test

746c3d9

add todo for streaming case

66130f4

Stateful dynamic batch sizing

c7de633

github-actions bot added the python label Oct 27, 2023

Formatting

c576da8

jrmccluskey added 6 commits October 30, 2023 16:19

Imporve stateful handling, incorporate existing tests

52017a2

Remove logging statement

4ff2bcf

line too long

8fc8c9b

Remove extra key for test data

85c2013

Reorder test methods

8f24ff4

whitespace

e396096

jrmccluskey changed the title ~~[WIP] Implement Cross-Bundle BatchElements~~ Implement Cross-Bundle BatchElements Oct 31, 2023

damccorm reviewed Oct 31, 2023

View reviewed changes

sdks/python/apache_beam/transforms/util.py Outdated Show resolved Hide resolved

jrmccluskey added 3 commits October 31, 2023 15:58

Move to dedicated route through BatchElements

cb82e7f

Remove standalone StatefulBatchElements

f52e1fe

Remove reference to StatefulBatchElements, fix test name

9df4eeb

jrmccluskey added this to the 2.52.0 Release milestone Oct 31, 2023

damccorm reviewed Oct 31, 2023

View reviewed changes

Streamline state read logic

a58498c

damccorm approved these changes Nov 1, 2023

View reviewed changes

damccorm merged commit 5d6c182 into apache:master Nov 1, 2023
73 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Cross-Bundle BatchElements #29175

Implement Cross-Bundle BatchElements #29175

jrmccluskey commented Oct 27, 2023

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

codecov bot commented Oct 30, 2023 •

edited

Loading

jrmccluskey commented Oct 31, 2023

github-actions bot commented Oct 31, 2023

jrmccluskey commented Oct 31, 2023

damccorm left a comment

damccorm Oct 31, 2023

jrmccluskey Nov 1, 2023

damccorm Nov 1, 2023

damccorm left a comment

Implement Cross-Bundle BatchElements #29175

Implement Cross-Bundle BatchElements #29175

Conversation

jrmccluskey commented Oct 27, 2023

GitHub Actions Tests Status (on master branch)

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

jrmccluskey commented Oct 30, 2023

codecov bot commented Oct 30, 2023 • edited Loading

Codecov Report

jrmccluskey commented Oct 31, 2023

github-actions bot commented Oct 31, 2023

jrmccluskey commented Oct 31, 2023

damccorm left a comment

Choose a reason for hiding this comment

damccorm Oct 31, 2023

Choose a reason for hiding this comment

jrmccluskey Nov 1, 2023

Choose a reason for hiding this comment

damccorm Nov 1, 2023

Choose a reason for hiding this comment

damccorm left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 30, 2023 •

edited

Loading