Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

apache / beam Public

Notifications You must be signed in to change notification settings
Fork 4.3k
Star 7.9k

Code
Issues 4.4k
Pull requests 98
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Task]: Improve python statecache #28802

Open

16 tasks

AnandInguva opened this issue Oct 3, 2023 · 3 comments

Open

16 tasks

[Task]: Improve python statecache #28802

AnandInguva opened this issue Oct 3, 2023 · 3 comments

Labels

awaiting triage P3 task

Comments

Copy link

Contributor

AnandInguva commented Oct 3, 2023 •

edited

Loading

What needs to happen?

Initially for python sdk, we will enable the statecache size from 0 MB to 100 MB. Then there are some improvements that could be made on the statecache. For example,

auto size the cache based upon amount of memory dedicated to docker container
add pipeline options to allow users to control % of memory for cache (with default of 20%) instead of relying on an experiment
cache beyond the first page
have a write through cache for user state

The Java implementation for the cache is in: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/Caches.java And most of the caching complexity is within: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/StateFetchingIterators.java With the views over these caches doing specific view level operations (e.g. merging old view of data with in-memory updates). Generally understanding the code in https://github.com/apache/beam/tree/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state should provide most answers.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner

The text was updated successfully, but these errors were encountered:

All reactions

AnandInguva added task awaiting triage labels

Copy link

Contributor Author

AnandInguva commented Oct 3, 2023

All reactions

Sorry, something went wrong.

github-actions bot added the P3 label

Copy link

Contributor Author

AnandInguva commented Oct 3, 2023

AsIter view_fn

Iterable might look one element at a time and this could be more for the side input cache on the GCS bucket?

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-10-03_13_02_55-13940213406894669659
Pipeline scaled up to ~ 47 workers for a simple job.

AsList view fn
List materializes so we wouldn’t need too many reads from the side input cache at GCS bucket?

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-10-03_13_03_00-15374614790418733789
Pipeline completed as expected

For AsIter with state_cache_size=100 mb,

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-10-03_13_23_37-9771630946918171566
With state cache enabled, this completed as expected since side input gets cached.

All reactions

Sorry, something went wrong.

Copy link

Contributor

tvalentyn commented Dec 13, 2023

State cache was enabled in #28770 .

All reactions

Sorry, something went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

awaiting triage P3 task

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

2 participants

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.