Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python]Enable state cache to 100 MB #28781

Merged
merged 13 commits into from
Oct 31, 2023

Conversation

AnandInguva
Copy link
Contributor

@AnandInguva AnandInguva commented Oct 3, 2023

Enable state_cache_size = 100 MB for python SDK.
Fixes: #28770

state_cache_size can be enabled using --state_cache_size=<X>MB. state_cache_size should be in terms of Megabytes.

EDIT:

From the doc - https://docs.google.com/document/u/1/d/1gllYsIFqKt4TWAxQmXU_-sw7SLnur2Q69d__N0XBMdE/edit?usp=drive_open&ouid=102749919556839394679, the consensus is to add a pipeline option named max_cache_memory_usage_mb and explain in Beam docs and runner docs on what this option is and how this options works.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Merging #28781 (aa94309) into master (a93fa51) will decrease coverage by 0.05%.
Report is 24 commits behind head on master.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master   #28781      +/-   ##
==========================================
- Coverage   38.36%   38.32%   -0.05%     
==========================================
  Files         687      688       +1     
  Lines      101745   101833      +88     
==========================================
- Hits        39037    39027      -10     
- Misses      61129    61230     +101     
+ Partials     1579     1576       -3     
Flag Coverage Δ
python 29.94% <85.71%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...dks/python/apache_beam/options/pipeline_options.py 64.91% <100.00%> (+0.07%) ⬆️
...apache_beam/runners/portability/portable_runner.py 28.27% <ø> (ø)
...thon/apache_beam/runners/worker/sdk_worker_main.py 66.66% <83.33%> (+0.77%) ⬆️

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@AnandInguva AnandInguva changed the title [WIP][Python]Enable state cache to 100 MB [Python]Enable state cache to 100 MB Oct 3, 2023
@AnandInguva
Copy link
Contributor Author

R: @tvalentyn

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@AnandInguva
Copy link
Contributor Author

AnandInguva commented Oct 4, 2023

Java has default of 100 MB

Do we want to have a similar flag to workerCacheMB? state_cache_size to state_cache_size_mb or worker_cache_mb?

@AnandInguva AnandInguva marked this pull request as ready for review October 4, 2023 18:37
@tvalentyn
Copy link
Contributor

state is a technical term that's not very user-friendly; it is good to have consistent naming. @lostluck does Go have state cache as well?

@lostluck
Copy link
Contributor

lostluck commented Oct 4, 2023

state is a technical term that's not very user-friendly; it is good to have consistent naming. @lostluck does Go have state cache as well?

It does, but it's not memory aware, just element sized.

Technically, it's used for cross bundle applications across the State API which side inputs also use.

Unless the cache value also applies to the Combiner Lifting cache, it wouldn't be a true "worker" cache, vs a "state" cache.

sdks/python/apache_beam/runners/worker/sdk_worker_main.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/runners/worker/sdk_worker_main.py Outdated Show resolved Hide resolved
return 0
if not state_cache_size:
# to maintain backward compatibility
for experiment in experiments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure there is already a helper that does this parsing.

sdks/python/apache_beam/runners/worker/sdk_worker_main.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/options/pipeline_options.py Outdated Show resolved Hide resolved
'--state_cache_size_mb',
dest='state_cache_size',
'--max_cache_memory_usage_mb',
dest='max_cache_memory_usage_mb',
type=int,
default=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concerns to define the 100mb default here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current flow: If it is None here, it gives us an opportunity to look in --experiements for state_cache_size.

If the value is defined here as 100 MB, and if the user passes --experiments=state_cache_size, we should override 100 MB for the --experiments=state_cache_size.

I don't see any concerns of setting default here. might need to change some code though

sdks/python/apache_beam/runners/worker/sdk_worker_main.py Outdated Show resolved Hide resolved
CHANGES.md Outdated
@@ -74,6 +74,7 @@ should handle this. ([#25252](https://github.com/apache/beam/issues/25252)).
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.
* `upload_graph` as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK ([PR#28621](https://github.com/apache/beam/pull/28621).
* state cache has been enabled to a default of 100 MB. Use `--max_cache_memory_usage_mb=X` to provide cache size. (Python) ([#28770](https://github.com/apache/beam/issues/28770)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly, we should make it clear here that this impacts both user state and side inputs

sdks/python/apache_beam/options/pipeline_options.py Outdated Show resolved Hide resolved
@damccorm
Copy link
Contributor

Run Python_PVR_Flink PreCommit

type=int,
default=100,
help=(
'Size of the SdkHarness cache to store user state and side inputs '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider following wording

            'Size of the SDK Harness cache to store user state and side inputs '
            'in MB. Default is 100MB. If the cache is full, least recently '
            'used elements will be evicted.  This cache is per '
            'each SDK Harness instance. SDK Harness  is a component  responsible '
            'for executing the user code and communicating with the runner. ' 
            'Depending on the runner, '
            'there may be more than one SDK Harness process running on the same worker node. '
            'Increasing cache size  might improve performance of some pipelines, but can lead to an increase '
            'in memory consumption and OOM errors if workers are not appropriately provisioned.'

@AnandInguva
Copy link
Contributor Author

Merging this since tests pass.

@AnandInguva AnandInguva merged commit d329a7e into apache:master Oct 31, 2023
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Enable statecache for Python SDK
4 participants