Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce cluster limit #18383

Merged
merged 15 commits into from
Sep 6, 2024
Merged

feat: introduce cluster limit #18383

merged 15 commits into from
Sep 6, 2024

Conversation

hzxa21
Copy link
Collaborator

@hzxa21 hzxa21 commented Sep 3, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Related discussion: #16092

To ensure smoother user experience when cluster resource is insufficient, this PR introduces cluster limit in kernel. At the 1st stage, the cluster limit only contains limit on actor count per worker parallelism and there are a lot of headroom for the default limit. The behavior is:

  • soft limit is configured via meta config (default to 100)
  • hard limit is configured via meta config (default to 400)
  • limit can be bypassed via session config

When the soft limit is hit, CREATE MV/TABLE/SINK can succeed with a notice message attached.
When the hard limit is hit, CREATE MV/TABLE/SINK cannot succeed with a notice message attached.

Example:

dev=> create materialized view test_cluster_limit as select * from test;
CREATE_MATERIALIZED_VIEW

dev=> create materialized view test_cluster_limit_exceed_soft as select * from test;
NOTICE:  
- Actor count per parallelism exceeds the recommended limit.
- Depending on your workload, this may overload the cluster and cause performance/stability issues. Scaling the cluster is recommended.
- Contact us via slack or https://risingwave.com/contact-us/ for further enquiry.
- You can bypass this check via SQL `SET bypass_cluster_limits TO true`.
- You can check actor count distribution via SQL `SELECT * FROM rw_worker_actor_count`.
ActorCountPerParallelism { critical limit: 8, recommended limit: 7. worker_id_to_actor_count: ["1 -> WorkerActorCount { actor_count: 32, parallelism: 4 }"] }
CREATE_MATERIALIZED_VIEW

dev=> create materialized view test_cluster_limit_exceed_hard as select * from test;
ERROR:  Failed to run the query

Caused by:
  Protocol error: 
- Actor count per parallelism exceeds the critical limit.
- Depending on your workload, this may overload the cluster and cause performance/stability issues. Please scale the cluster before proceeding!
- Contact us via slack or https://risingwave.com/contact-us/ for further enquiry.
- You can bypass this check via SQL `SET bypass_cluster_limits TO true`.
- You can check actor count distribution via SQL `SELECT * FROM rw_worker_actor_count`.
ActorCountPerParallelism { critical limit: 7, recommended limit: 6. worker_id_to_actor_count: ["1 -> WorkerActorCount { actor_count: 32, parallelism: 4 }"] }

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

A limit on actor count per parallelism (i.e. CPU core) is added to warn user on CREATE MATERIALIZED VIEW/TABLE/SINK statements to ensure smoother user experience when cluster resource is insufficient. Both a soft and hard limit are introduced:

  • meta_actor_cnt_per_worker_parallelism_soft_limit (default to 100): when the limit is exceeded, CREATE MATERIALIZED VIEW/TABLE/SINK can still succeed but a SQL notice message will be included when the statement returns.
  • meta_actor_cnt_per_worker_parallelism_hard_limit (default to 400): when the limit is exceeded, CREATE MATERIALIZED VIEW/TABLE/SINK will fail and a error message will be included when the statement returns.

Notes:

  • The limit will not affect existing created database object in the cluster. In other words, there will be no disruption on MVs/TABLEs/SINKs that are already created. It will only affect future CREATE MATERIALIZED VIEW/TABLE/SINK.
  • Scaling the cluster or reducing the streaming job parallelism is the preferred way to resolve the limits.
  • The default limits are subjective to change in future releases.
  • If users believe it is safe for the cluster to run with the limits exceeded, there are two ways to override the behavior
    • Bypass the check via session variable: SET bypass_cluster_limits TO true.
    • Increase the limit via meta developer config:
    [meta.developer]
    meta_actor_cnt_per_worker_parallelism_soft_limit = 100
    meta_actor_cnt_per_worker_parallelism_hard_limit = 400
    However, keep in mind the cluster can be overload with potential degradation in stability/availability/performance if the check is bypass or the limits are manually increased. Please operate at your own risk with cautions.

@fuyufjh fuyufjh requested a review from neverchanje September 4, 2024 03:00
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Sep 4, 2024

Example for free-tier: No longer applicable

> create table t_free_tier_exceed_soft(a int);
NOTICE:  
- Actor count per parallelism exceeds the recommended limit.
- This may overload the cluster and cause performance/stability issues. Scaling the cluster is recommended.
- Contact us via https://risingwave.com/contact-us/ and consider upgrading your free-tier license to enhance performance/stability and user experience for production usage.
- You can check actor count distribution via SQL `SELECT * FROM rw_worker_actor_count`.
ActorCountPerParallelism { critical limit: 1, recommended limit: 0. worker_id_to_actor_count: ["1 -> WorkerActorCount { actor_count: 4, parallelism: 4 }"] }
CREATE_TABLE


> create table t_free_tier_exceed_hard(a int);
ERROR:  Failed to run the query

Caused by:
  Protocol error: 
- Actor count per parallelism exceeds the critical limit.
- This may overload the cluster and cause performance/stability issues. Please scale the cluster before proceeding.
- Contact us via https://risingwave.com/contact-us/ and consider upgrading your free-tier license to enhance performance/stability and user experience for production usage.
- You can check actor count distribution via SQL `SELECT * FROM rw_worker_actor_count`.
ActorCountPerParallelism { critical limit: 1, recommended limit: 0. worker_id_to_actor_count: ["1 -> WorkerActorCount { actor_count: 6, parallelism: 4 }"] }


@neverchanje
Copy link
Contributor

Please remember to add a notice on the release note.
Open-source users may fail to create new materialized views when they have reached the limit.

proto/meta.proto Outdated Show resolved Hide resolved
src/common/src/session_config/mod.rs Outdated Show resolved Hide resolved
src/meta/service/src/cluster_limit_service.rs Outdated Show resolved Hide resolved
src/frontend/src/session.rs Outdated Show resolved Hide resolved
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the rest

message ClusterLimit {
oneof limit {
ActorCountPerParallelism actor_count = 1;
// TODO: limit DDL using compaction pending bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure they are either-or relation?

self.env.opts.actor_cnt_per_worker_parallelism_soft_limit,
self.env.opts.actor_cnt_per_worker_parallelism_hard_limit,
),
Ok(Tier::Free) => (
Copy link
Member

@fuyufjh fuyufjh Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems still a bit controversial to decide whether & how we should set a different limit for free users. Thus, I tend to use the same limit for both paid and free users for now until we have clearer ideas in the future.

This idea also applies to bypass_cluster_limits, which is now unable to be modified by free users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I have updated the PR to unify the behavior for different tiers.

@hzxa21 hzxa21 added user-facing-changes Contains changes that are visible to users need-cherry-pick-release-2.0 labels Sep 6, 2024
@hzxa21 hzxa21 enabled auto-merge September 6, 2024 06:18
@hzxa21 hzxa21 added this pull request to the merge queue Sep 6, 2024
Merged via the queue into main with commit 4accc4a Sep 6, 2024
47 of 48 checks passed
@hzxa21 hzxa21 deleted the patrick/cluster-limit.pr branch September 6, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants