discussion: limit the number of (stateful) actors per CN #16092

fuyufjh · 2024-04-02T13:27:11Z

Motivation

Generally, when there are many (e.g. 10+) streaming jobs running in one RisingWave instance, it's no longer a good idea to use full CPU cores for all fragments. This proposal is trying to address this problem.

Recently, we found several issues related to the number of actors per CN:

Local HummockUploader failed to make the new version in one checkpoint_interval, which caused barriers to pile up. Fixed by perf(storage): simplify table watermark index #15931.
Actor-level metrics sometimes can be too heavy for Prometheus Incredible many metrics exported from meta and compute nodes #14821
Due to many reasons, there is a fixed memory front print for each actor. The more actors, the less memory space for our streaming cache. In longevity test, the cached data is nearly zero.

Design

I think we could introduce a soft limit and a hard limit for each CN

When the number of actors is lower than the soft limit, nothing happen. This should be the case of 90% users.
When the number of actors is higher than the soft limit but below the hard limit, when the users creating new materialized views/sinks or scaling an existing job, a notice message will be presented to users to urge them scale in some streaming queries.
When the number of actors is higher than hard limit, RisingWave will refuse the request and return an error.

In the notice message, the users are encouraged to use the alter command to set a smaller parallelism on existing streaming jobs.

Implementation

The implementation is trivial, but we need to carefully pick a default threshold. For example,

actors_soft_limit_per_core = 100
actors_hard_limit_per_core = 200

TBD

Should we limit the stateful actors or all actors? Asked this majorly because we now have multiple actors for a table according to the new design of DML. cc @st1page

The text was updated successfully, but these errors were encountered:

lmatz · 2024-05-14T09:45:36Z

link: #15668
not super related, but it builds on the same presumption that: "it's no longer a good idea to use full CPU cores for all fragments"

st1page · 2024-05-15T11:34:20Z

How about directly limit the number of the physical state table instance?

github-actions · 2024-08-01T02:08:02Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

fuyufjh added the needs-discussion label Apr 2, 2024

github-actions bot added this to the release-1.8 milestone Apr 2, 2024

fuyufjh modified the milestones: release-1.8, release-1.9 Apr 8, 2024

fuyufjh removed this from the release-1.9 milestone May 14, 2024

github-actions bot added the no-issue-activity label Aug 1, 2024

hzxa21 mentioned this issue Sep 3, 2024

feat: introduce cluster limit #18383

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: limit the number of (stateful) actors per CN #16092

discussion: limit the number of (stateful) actors per CN #16092

fuyufjh commented Apr 2, 2024

lmatz commented May 14, 2024

st1page commented May 15, 2024

github-actions bot commented Aug 1, 2024

discussion: limit the number of (stateful) actors per CN #16092

discussion: limit the number of (stateful) actors per CN #16092

Comments

fuyufjh commented Apr 2, 2024

Motivation

Design

Implementation

TBD

lmatz commented May 14, 2024

st1page commented May 15, 2024

github-actions bot commented Aug 1, 2024