-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): support compaction group split and merge #17898
Conversation
…nto li0k/storage_split
…nto li0k/storage_split
proto/hummock.proto
Outdated
@@ -28,6 +28,7 @@ message SstableInfo { | |||
uint64 uncompressed_file_size = 11; | |||
uint64 range_tombstone_count = 12; | |||
BloomFilterType bloom_filter_kind = 13; | |||
uint64 estimated_sst_size = 14; // for compaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more detailed documentation to explain why we need this new field and the difference between estimated_sst_size
and file_size
src/frontend/src/catalog/system_catalog/rw_catalog/rw_hummock_version.rs
Outdated
Show resolved
Hide resolved
parent_group_id, | ||
table_ids, | ||
table_ids[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we assume the provided table_ids
always contain 1 element? If yes, we should change the method signature instead of making assumption here.
let split_key = { | ||
let mut vnode_index = vnode.to_index(); | ||
|
||
if vnode == VirtualNode::MAX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an issue in this PR but just a reminder: not all state table will exhaust the vnode space [0, VirutalNode::MAX]. IIRC, some executor will only use vnode 0 in its state table.
let member_table_ids = versioning | ||
.current_version | ||
.state_table_info | ||
.compaction_group_member_table_ids(parent_group_id); | ||
|
||
if !member_table_ids.contains(&TableId::new(table_id)) { | ||
return Err(Error::CompactionGroup(format!( | ||
"table {} doesn't in group {}", | ||
table_id, parent_group_id | ||
))); | ||
} | ||
|
||
let table_ids = member_table_ids | ||
.iter() | ||
.map(|table_id| table_id.table_id) | ||
.collect_vec(); | ||
|
||
let pos = table_ids.partition_point(|&id| id <= table_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: member_table_ids
is a BTreeSet
. I think we don't need to turn it into a vec?
if pos == 0 { | ||
return Err(Error::CompactionGroup(format!( | ||
"Not to split ({}) all tables from group {}", | ||
table_id, parent_group_id | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check is duplicated with the check in L661.
let member_table_ids_1 = state_table_info | ||
.compaction_group_member_table_ids(group_1) | ||
.iter() | ||
.cloned() | ||
.collect_vec(); | ||
|
||
let member_table_ids_2 = state_table_info | ||
.compaction_group_member_table_ids(group_2) | ||
.iter() | ||
.cloned() | ||
.collect_vec(); | ||
|
||
let mut combine_1 = member_table_ids_1.clone(); | ||
combine_1.extend_from_slice(&member_table_ids_2); | ||
|
||
let mut combine_2 = member_table_ids_2; | ||
combine_2.extend_from_slice(&member_table_ids_1); | ||
|
||
if !combine_1.is_sorted() && !combine_2.is_sorted() { | ||
return Err(Error::CompactionGroup(format!( | ||
"invalid merge group_1 {} group_2 {}", | ||
group_1, group_2 | ||
))); | ||
} | ||
|
||
let mut left_group_id = group_1; | ||
let mut right_group_id = group_2; | ||
let combine_member_table_ids = if combine_1.is_sorted() { | ||
combine_1 | ||
} else { | ||
std::mem::swap(&mut left_group_id, &mut right_group_id); | ||
combine_2 | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compaction_group_member_table_ids
returns a BTreeSet. The logic here can be simplified to:
let member_table_ids_1 = state_table_info.compaction_group_member_table_ids(group_1);
let member_table_ids_2 = state_table_info.compaction_group_member_table_ids(group_2);
assert!(!member_table_ids_1.is_empty());
assert!(!member_table_ids_2.is_empty());
let (left_group, right_group) = if member_table_ids_1.first().unwrap() < member_table_ids_2.first().unwrap() {
((group_1, member_table_ids_1), (group_2, member_table_ids_2))
} else {
((group_2, member_table_ids_2), (group_1, member_table_ids_1))
}
if left_group.1.last().unwrap() > right_group.1.first().unwrap() {
return Err(Error::CompactionGroup(format!(
"invalid merge group_1 {} group_2 {}",
left_group, right_group
)));
}
let (left_group_id, right_group_id) = (left_group.0, right_group.0);
let combine_member_table_ids = left_group.1.iter().chain(right_group.1.iter()).cloned().collect_vec();
…nto li0k/storage_split
…nto li0k/storage_split
…nto li0k/storage_split
…nto li0k/storage_split
…nto li0k/storage_split
…nto li0k/storage_split
…nto li0k/storage_split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost finish the review. Let me post the comments I have so far first before I continue to add more.
src/frontend/src/catalog/system_catalog/rw_catalog/rw_hummock_version.rs
Outdated
Show resolved
Hide resolved
src/meta/src/hummock/manager/compaction/compaction_group_manager.rs
Outdated
Show resolved
Hide resolved
src/meta/src/hummock/manager/compaction/compaction_group_schedule.rs
Outdated
Show resolved
Hide resolved
src/storage/hummock_sdk/src/compaction_group/hummock_version_ext.rs
Outdated
Show resolved
Hide resolved
let sst_split_type = need_to_split(sst, split_key.clone()); | ||
match sst_split_type { | ||
SstSplitType::Left => { | ||
left_sst.push(sst.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking whether we should use filter_map
here when iterating level.table_infos
to avoid cloning SSTableInfo when the sst doesn't need to be split. We have seen performance issue when cloning SSTableInfo. Given that apply_version_delta
happens in the critical path of meta and CN operation, please be careful about SSTableInfo
clone in your PR and try to avoid it as much as possible.
// for backward-compatibility of previous hummock version delta | ||
BTreeSet::from_iter(group_construct.table_ids.clone()) | ||
}; | ||
|
||
self.init_with_parent_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove these codes, I think we may break backward compatibility if there is an old group delta for split constructed in old version (no split_key) but not yet applied. In this case, compaction group id in state_table_info
will be modified but actually the split doesn't happen.
Btw, I think the PR is larger than I expected. To reduce the risk of this PR, is it possible to further breaking it into small ones so that we can merge small PRs one by one. |
Thanks, let's review the code framework first before "ready to review" |
This PR has been open for 60 days with no activity. If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it. If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏) |
Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏 You can reopen it when you have time to continue working on it. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Previously, we implemented the concept of compaction groups for hummock to achieve write/configuration isolation and to create new compaction groups via the table-level move function.
Previously, we implemented the concept of compaction groups for hummock to achieve write/configuration isolation and to create new compaction groups via the table-level move function.
Since the sst key range generated by the move function is not contiguous, this makes it difficult to re-combine the resulting independent compaction groups (currently not supported). Because of this concern, we set the move threshold very high, fearing that the resulting compaciton group will not be able to be merged again, thus creating a large number of compaction groups.
This concern is contrary to our expectation that in scenarios with high write throughput (backfill, etc.), we would like to split the compaction to achieve parallel compaction, increase the overall compaction throughput, and make full use of the compactor node's scale capability.
Therefore, we redesigned hummock's scalability by implementing split / merge instead of move, in order to obtain the following capabilities.
For the current PR, we expect to implement table-level Split/Merge capability (phase 1), and plan to further implement Key-level Split in the subsequent PRs, and make CN not aware of the concept of group_id. Here are the changes involved in this PR for review.
I'll summarise the changes to the file, for your code reivew:
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.