-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python]Enable state cache to 100 MB #28781
Conversation
4578acf
to
49f37c7
Compare
Codecov Report
@@ Coverage Diff @@
## master #28781 +/- ##
==========================================
- Coverage 38.36% 38.32% -0.05%
==========================================
Files 687 688 +1
Lines 101745 101833 +88
==========================================
- Hits 39037 39027 -10
- Misses 61129 61230 +101
+ Partials 1579 1576 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 8 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
49f37c7
to
ba96506
Compare
ba96506
to
22abdc1
Compare
R: @tvalentyn |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Line 247 in f30f6c5
Java has default of 100 MB Do we want to have a similar flag to |
|
22abdc1
to
b1ab7a3
Compare
It does, but it's not memory aware, just element sized. Technically, it's used for cross bundle applications across the State API which side inputs also use. Unless the cache value also applies to the Combiner Lifting cache, it wouldn't be a true "worker" cache, vs a "state" cache. |
return 0 | ||
if not state_cache_size: | ||
# to maintain backward compatibility | ||
for experiment in experiments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure there is already a helper that does this parsing.
'--state_cache_size_mb', | ||
dest='state_cache_size', | ||
'--max_cache_memory_usage_mb', | ||
dest='max_cache_memory_usage_mb', | ||
type=int, | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concerns to define the 100mb default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current flow: If it is None here, it gives us an opportunity to look in --experiements
for state_cache_size
.
If the value is defined here as 100 MB, and if the user passes --experiments=state_cache_size
, we should override 100 MB for the --experiments=state_cache_size
.
I don't see any concerns of setting default here. might need to change some code though
CHANGES.md
Outdated
@@ -74,6 +74,7 @@ should handle this. ([#25252](https://github.com/apache/beam/issues/25252)). | |||
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed | |||
using the DataSet API. | |||
* `upload_graph` as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK ([PR#28621](https://github.com/apache/beam/pull/28621). | |||
* state cache has been enabled to a default of 100 MB. Use `--max_cache_memory_usage_mb=X` to provide cache size. (Python) ([#28770](https://github.com/apache/beam/issues/28770)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a short description of what this cache is and link to https://beam.apache.org/releases/pydoc/2.50.0/apache_beam.options.pipeline_options.html#module-apache_beam.options.pipeline_options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly, we should make it clear here that this impacts both user state and side inputs
Run Python_PVR_Flink PreCommit |
type=int, | ||
default=100, | ||
help=( | ||
'Size of the SdkHarness cache to store user state and side inputs ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider following wording
'Size of the SDK Harness cache to store user state and side inputs '
'in MB. Default is 100MB. If the cache is full, least recently '
'used elements will be evicted. This cache is per '
'each SDK Harness instance. SDK Harness is a component responsible '
'for executing the user code and communicating with the runner. '
'Depending on the runner, '
'there may be more than one SDK Harness process running on the same worker node. '
'Increasing cache size might improve performance of some pipelines, but can lead to an increase '
'in memory consumption and OOM errors if workers are not appropriately provisioned.'
Merging this since tests pass. |
Enable state_cache_size = 100 MB for python SDK.
Fixes: #28770
state_cache_size can be enabled using
--state_cache_size=<X>MB
. state_cache_size should be in terms of Megabytes.EDIT:
From the doc - https://docs.google.com/document/u/1/d/1gllYsIFqKt4TWAxQmXU_-sw7SLnur2Q69d__N0XBMdE/edit?usp=drive_open&ouid=102749919556839394679, the consensus is to add a pipeline option named
max_cache_memory_usage_mb
and explain in Beam docs and runner docs on what this option is and how this options works.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.