Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): support compaction group split and merge #17898

Closed
wants to merge 52 commits into from

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Aug 1, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously, we implemented the concept of compaction groups for hummock to achieve write/configuration isolation and to create new compaction groups via the table-level move function.

Previously, we implemented the concept of compaction groups for hummock to achieve write/configuration isolation and to create new compaction groups via the table-level move function.

Since the sst key range generated by the move function is not contiguous, this makes it difficult to re-combine the resulting independent compaction groups (currently not supported). Because of this concern, we set the move threshold very high, fearing that the resulting compaciton group will not be able to be merged again, thus creating a large number of compaction groups.

This concern is contrary to our expectation that in scenarios with high write throughput (backfill, etc.), we would like to split the compaction to achieve parallel compaction, increase the overall compaction throughput, and make full use of the compactor node's scale capability.

Therefore, we redesigned hummock's scalability by implementing split / merge instead of move, in order to obtain the following capabilities.

  1. adopt more lenient split conditions to cope with high write/burst write traffic scenarios, to extend compaction capability horizontally.
  2. auto-merge compaction groups after write throughput is restored.
  3. refine the granularity of Split from table to key_range, to split huge compaction groups at a finer granularity
  4. Better O&M tools to manually split and merge compaction groups.

For the current PR, we expect to implement table-level Split/Merge capability (phase 1), and plan to further implement Key-level Split in the subsequent PRs, and make CN not aware of the concept of group_id. Here are the changes involved in this PR for review.

  1. add the concept of estimated_size to SstableInfo for more accurate Compaction policy calculation.
  2. based on table_id level Split function, compatible with the original Split interface.
  • Fix key_range/table_id/estimated size after sst split.
  • Use (table_id, vnode = 0, epoch = 0) as split key for splitting.
  1. merge compaction group
  2. schedule capability for automatic split/merge.
  3. compactor iterator support key_range filter

I'll summarise the changes to the file, for your code reivew:

  1. src/meta/src/hummock/manager/commit_epoch.rs adapt the logic of commit_epoch so that all NewTableFragmentInfo creates a new compaction group
  2. src/meta/src/hummock/manager/compaction/compaction_group_schedule.rs introduce helper functions that facilitate splitting, refactor the split interface based on the new implementation, and provide a simple auto split/merge strategy. (The auto split strategy is based on the previous design).
  3. src/meta/src/hummock/manager/timer_task.rs periodically call the try function provided by schedule to implement auto split/merge.
  4. src/storage/hummock_sdk/src/compaction_group/hummock_version_ext.rs reimplemented split logic based on split_key.
  5. src/storage/hummock_sdk/src/compaction_group/mod.rs provides a wrapper for the split_key based sst / level split implementation.
  6. src/storage/hummock_sdk/src/version.rs removed cg2 / cg3 special assumptions.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@@ -28,6 +28,7 @@ message SstableInfo {
uint64 uncompressed_file_size = 11;
uint64 range_tombstone_count = 12;
BloomFilterType bloom_filter_kind = 13;
uint64 estimated_sst_size = 14; // for compaction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more detailed documentation to explain why we need this new field and the difference between estimated_sst_size and file_size

parent_group_id,
table_ids,
table_ids[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we assume the provided table_ids always contain 1 element? If yes, we should change the method signature instead of making assumption here.

let split_key = {
let mut vnode_index = vnode.to_index();

if vnode == VirtualNode::MAX {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an issue in this PR but just a reminder: not all state table will exhaust the vnode space [0, VirutalNode::MAX]. IIRC, some executor will only use vnode 0 in its state table.

proto/hummock.proto Show resolved Hide resolved
Comment on lines 656 to 673
let member_table_ids = versioning
.current_version
.state_table_info
.compaction_group_member_table_ids(parent_group_id);

if !member_table_ids.contains(&TableId::new(table_id)) {
return Err(Error::CompactionGroup(format!(
"table {} doesn't in group {}",
table_id, parent_group_id
)));
}

let table_ids = member_table_ids
.iter()
.map(|table_id| table_id.table_id)
.collect_vec();

let pos = table_ids.partition_point(|&id| id <= table_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: member_table_ids is a BTreeSet. I think we don't need to turn it into a vec?

Comment on lines 675 to 680
if pos == 0 {
return Err(Error::CompactionGroup(format!(
"Not to split ({}) all tables from group {}",
table_id, parent_group_id
)));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is duplicated with the check in L661.

Comment on lines 850 to 882
let member_table_ids_1 = state_table_info
.compaction_group_member_table_ids(group_1)
.iter()
.cloned()
.collect_vec();

let member_table_ids_2 = state_table_info
.compaction_group_member_table_ids(group_2)
.iter()
.cloned()
.collect_vec();

let mut combine_1 = member_table_ids_1.clone();
combine_1.extend_from_slice(&member_table_ids_2);

let mut combine_2 = member_table_ids_2;
combine_2.extend_from_slice(&member_table_ids_1);

if !combine_1.is_sorted() && !combine_2.is_sorted() {
return Err(Error::CompactionGroup(format!(
"invalid merge group_1 {} group_2 {}",
group_1, group_2
)));
}

let mut left_group_id = group_1;
let mut right_group_id = group_2;
let combine_member_table_ids = if combine_1.is_sorted() {
combine_1
} else {
std::mem::swap(&mut left_group_id, &mut right_group_id);
combine_2
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compaction_group_member_table_ids returns a BTreeSet. The logic here can be simplified to:

let member_table_ids_1 = state_table_info.compaction_group_member_table_ids(group_1);
let member_table_ids_2 = state_table_info.compaction_group_member_table_ids(group_2);
assert!(!member_table_ids_1.is_empty());
assert!(!member_table_ids_2.is_empty());

let (left_group, right_group) = if member_table_ids_1.first().unwrap() < member_table_ids_2.first().unwrap() {
    ((group_1, member_table_ids_1), (group_2, member_table_ids_2))
} else {
    ((group_2, member_table_ids_2), (group_1, member_table_ids_1))
}

if left_group.1.last().unwrap() > right_group.1.first().unwrap() {
    return Err(Error::CompactionGroup(format!(
                "invalid merge group_1 {} group_2 {}",
                left_group, right_group
            )));
}

let (left_group_id, right_group_id) = (left_group.0, right_group.0);
let combine_member_table_ids = left_group.1.iter().chain(right_group.1.iter()).cloned().collect_vec();

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost finish the review. Let me post the comments I have so far first before I continue to add more.

src/common/src/config.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/src/compaction_group/mod.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/src/compaction_group/mod.rs Outdated Show resolved Hide resolved
let sst_split_type = need_to_split(sst, split_key.clone());
match sst_split_type {
SstSplitType::Left => {
left_sst.push(sst.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking whether we should use filter_map here when iterating level.table_infos to avoid cloning SSTableInfo when the sst doesn't need to be split. We have seen performance issue when cloning SSTableInfo. Given that apply_version_delta happens in the critical path of meta and CN operation, please be careful about SSTableInfo clone in your PR and try to avoid it as much as possible.

// for backward-compatibility of previous hummock version delta
BTreeSet::from_iter(group_construct.table_ids.clone())
};

self.init_with_parent_group(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove these codes, I think we may break backward compatibility if there is an old group delta for split constructed in old version (no split_key) but not yet applied. In this case, compaction group id in state_table_info will be modified but actually the split doesn't happen.

@hzxa21
Copy link
Collaborator

hzxa21 commented Sep 13, 2024

Btw, I think the PR is larger than I expected. To reduce the risk of this PR, is it possible to further breaking it into small ones so that we can merge small PRs one by one.

@Li0k
Copy link
Contributor Author

Li0k commented Sep 13, 2024

Btw, I think the PR is larger than I expected. To reduce the risk of this PR, is it possible to further breaking it into small ones so that we can merge small PRs one by one.

Thanks, let's review the code framework first before "ready to review"

Copy link
Contributor

This PR has been open for 60 days with no activity.

If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it.

If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the no-pr-activity label.

You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏)
Don't worry if you think the PR is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Copy link
Contributor

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏

You can reopen it when you have time to continue working on it.

@github-actions github-actions bot closed this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants