Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Improve python statecache #28802

Open
16 tasks
AnandInguva opened this issue Oct 3, 2023 · 3 comments
Open
16 tasks

[Task]: Improve python statecache #28802

AnandInguva opened this issue Oct 3, 2023 · 3 comments

Comments

@AnandInguva
Copy link
Contributor

AnandInguva commented Oct 3, 2023

What needs to happen?

Initially for python sdk, we will enable the statecache size from 0 MB to 100 MB. Then there are some improvements that could be made on the statecache. For example,

  • auto size the cache based upon amount of memory dedicated to docker container
  • add pipeline options to allow users to control % of memory for cache (with default of 20%) instead of relying on an experiment
  • cache beyond the first page
  • have a write through cache for user state

The Java implementation for the cache is in: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/Caches.java And most of the caching complexity is within: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/StateFetchingIterators.java With the views over these caches doing specific view level operations (e.g. merging old view of data with in-memory updates). Generally understanding the code in https://github.com/apache/beam/tree/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state should provide most answers.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@AnandInguva
Copy link
Contributor Author

cc: @tvalentyn

@github-actions github-actions bot added the P3 label Oct 3, 2023
@AnandInguva
Copy link
Contributor Author

AsIter view_fn

Iterable might look one element at a time and this could be more for the side input cache on the GCS bucket?

AsList view fn
List materializes so we wouldn’t need too many reads from the side input cache at GCS bucket?

For AsIter with state_cache_size=100 mb,

@tvalentyn
Copy link
Contributor

State cache was enabled in #28770 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants