feat: target build optimization for merge into #14066

JackTan25 · 2023-12-18T12:36:22Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This pr waits for #13950
import new hashtable with blkinfo for target build merge into, it benefits:

1. we can reduce IO when we do merge into operations.
1. when there are multi rows matched, we can return an error early.
  old implementation:

now we will kill the red load part.

The design and usage:
This Index can be only used for target build merge into (both standalone and distributed mode, but for this pr we just support standalone).

Advantages:

Reduces redundant I/O operations, enhancing performance.
Lowers the maintenance overhead of deduplicating row_id.(But in distributed design, we also need to give rowid)
Allows the scheduling of the subsequent mutation pipeline to be entirely allocated to not matched append operations.

Disadvantages:

This solution is likely to be a one-time approach (especially if there are not matched insert operations involved),
potentially leading to the target table being unsuitable for use as a build table in the future.
Requires a significant amount of memory to be efficient and currently does not support spill operations.
for now we just support sql like below:
merge into t using source on xxx when matched then update xxx when not macthed then insert xxx.

Future Enhancement

the others case we need to enhance:

if there are multi matched clauses we need to push up the conditions
if there is only one matched update but it has condition.
distributed mode, unmatched partial modified block offsets merge and pipeline construct.

Pr Design

We import a new component called BlockInfoIndex for hash table. If we don't trigger spill, we will put all data in
chunks. For target table as build side, we will read all data blocks from target table. But chunks will merge some blocks into a larger chunk. So we will get the layout like below:

/// the segment blocks are not sequential,because we do parallel hashtable build.
/// the block lay out in chunks could be like belows:
/// segment0_block1 |
/// segment1_block0 |  chunk0
/// segment0_block0 |
///
/// segment0_block3 |
/// segment1_block1 |  chunk1
/// segment2_block0 |
///
/// .........

So we use BlockInfoIndex to maintain an index for the block info in chunks.

pub struct BlockInfoIndex {
    // the intervals will be like below:
    // (0,10)(11,29),(30,38). it's ordered.
    intervals: Vec<Interval>,
    prefixs: Vec<u64>,
    length: usize,
}

And we can use BlockInfoIndex to find out the partial modified blocks quickly:

    /// we do a binary search to get the partial modified offsets
    /// we will return the Interval and prefix. For example:
    /// intervals: (0,10)(11,22),(23,40)(41,55)
    /// interval: (8,27)
    /// we will give (8,10),(23,27), we don't give the (11,12),because it's updated all.
    /// case1: |-----|------|------|
    ///            |-----------|
    /// case2: |-----|------|------|
    ///              |------|
    /// case3: |-----|------|------|
    ///                |--|
    /// case4: |-----|------|------|
    ///              |--------|

Optimiaztions tracking: #12595

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

…new_hashtable_with_blkinfo_for_target_build

src/common/hashtable/src/utils.rs

…de, add check multirows conflict

…new_hashtable_with_blkinfo_for_target_build

…ne mode

…/github.com/JackTan25/databend into new_hashtable_with_blkinfo_for_target_build

src/query/service/src/pipelines/pipeline_builder.rs

src/query/service/src/interpreters/interpreter_merge_into.rs

src/query/service/src/pipelines/builders/builder_join.rs

src/query/service/src/pipelines/processors/transforms/hash_join/hash_join_build_state.rs

src/query/service/src/pipelines/processors/transforms/hash_join/hash_join_probe_state.rs

src/query/service/src/pipelines/processors/transforms/hash_join/probe_join/left_join.rs

src/query/service/src/pipelines/processors/transforms/hash_join/transform_hash_join_probe.rs

…/github.com/JackTan25/databend into new_hashtable_with_blkinfo_for_target_build

Dousir9 · 2024-01-17T10:17:44Z

The part about hash join looks good to me.

…/github.com/JackTan25/databend into new_hashtable_with_blkinfo_for_target_build

* init blockinfo hashtable * add some comments * add more comments for hash_table interface * add merge_into_join_type info and block_info index * add block info hashtable basic implementation * fix typos * add RowPrefix for native_deserialize and parquet_deserialize * fix lint * add gather_partial_modified and reduce_false_matched * refactor: remove block info hashtable and build blockinfo index outside, add check multirows conflict * fix blockinfo index * gather partial modified blocks and fix lint * remove rowid when use target table as build side * support target_build_optimization for merge into pipeline in standalone mode * add more tests, and enhance explain merge into, add fix add merge status when target table build optimization is triggered * add probe done output logic and add more tests * add one chunk ut test for block_info_index * fix test result * add more commnnts for merge into strategies, and fix rowid read * fix test * fix split * fix block_info_index init, matched offsets update and add target_table_schema for partial unmodified blocks to append directly, add probe attach for target_build_optimization, fix merge intp explain update order * fix all matched delete for target build optimization * fix test * add info log * add logs * add debug logs * add debug logs * fix lint * forbiden native engine for target build optimization * add logs * add more log * add debug log * fix multi chunks start offset and add skip chunk ut test * support recieve duplicated block for matched_mutator * move logic code * fix flaky matched and fix offset for pointer (chunk_offsets shouldn't minus one) * add merge_state * refactor codes * add more commnets * refactor codes, split merge into optimziation codes into other files * remove a.txt * fix check * chore: modify function name * rename variables with merge_into prefix * rename function * move merge_into_try_build_block_info_index to front

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Dec 18, 2023

JackTan25 marked this pull request as draft December 18, 2023 12:36

JackTan25 changed the title ~~feat:~~ feat: new hashtable with blkinfo for target build merge into Dec 18, 2023

JackTan25 closed this Dec 31, 2023

JackTan25 force-pushed the new_hashtable_with_blkinfo_for_target_build branch from 8ad47c7 to 16b0e53 Compare December 31, 2023 18:04

init blockinfo hashtable

c600104

JackTan25 reopened this Jan 2, 2024

JackTan25 and others added 9 commits January 5, 2024 11:50

Merge branch 'main' of https://github.com/datafuselabs/databend into …

52b2929

…new_hashtable_with_blkinfo_for_target_build

add some comments

c28de64

Merge branch 'main' of https://github.com/datafuselabs/databend into …

c0f4606

…new_hashtable_with_blkinfo_for_target_build

add more comments for hash_table interface

e4a1814

add merge_into_join_type info and block_info index

ce199d4

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

4bc6e91

add block info hashtable basic implementation

45f9099

fix typos

f24d328

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

356404c

Dousir9 self-requested a review January 9, 2024 11:04

JackTan25 and others added 4 commits January 9, 2024 21:28

add RowPrefix for native_deserialize and parquet_deserialize

11dc6d0

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

d485bed

fix lint

c4bdedb

add gather_partial_modified and reduce_false_matched

81c47bb

sundy-li reviewed Jan 10, 2024

View reviewed changes

src/common/hashtable/src/utils.rs Outdated Show resolved Hide resolved

JackTan25 and others added 8 commits January 10, 2024 21:07

refactor: remove block info hashtable and build blockinfo index outsi…

1676466

…de, add check multirows conflict

fix blockinfo index

6a3f722

gather partial modified blocks and fix lint

8d8a423

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

4f7968c

Merge branch 'main' of https://github.com/datafuselabs/databend into …

7997f73

…new_hashtable_with_blkinfo_for_target_build

remove rowid when use target table as build side

fc6780f

support target_build_optimization for merge into pipeline in standalo…

859f0e1

…ne mode

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

3726c7e

JackTan25 and others added 5 commits January 16, 2024 23:18

remove a.txt

865ea8b

fix check

7bbedcd

Merge branch 'main' into new_hashtable_with_blkinfo_for_target_build

d2f56ae

chore: modify function name

f5ca491

Merge branch 'new_hashtable_with_blkinfo_for_target_build' of https:/…

aab662c

…/github.com/JackTan25/databend into new_hashtable_with_blkinfo_for_target_build