Skip to content

Commit

Permalink
feat: target build optimization for merge into (databendlabs#14066)
Browse files Browse the repository at this point in the history
* init blockinfo hashtable

* add some comments

* add more comments for hash_table interface

* add merge_into_join_type info and block_info index

* add block info hashtable basic implementation

* fix typos

* add RowPrefix for native_deserialize and parquet_deserialize

* fix lint

* add gather_partial_modified and reduce_false_matched

* refactor: remove block info hashtable and build blockinfo index outside, add check multirows conflict

* fix blockinfo index

* gather partial modified blocks and fix lint

* remove rowid when use target table as build side

* support target_build_optimization for merge into pipeline in standalone mode

* add more tests, and enhance explain merge into, add fix add merge status when target table build optimization is triggered

* add probe done output logic and add more tests

* add one chunk ut test for block_info_index

* fix test result

* add more commnnts for merge into strategies, and fix rowid read

* fix test

* fix split

* fix block_info_index init, matched offsets update and add target_table_schema for partial unmodified blocks to append directly, add probe attach for target_build_optimization, fix merge intp explain update order

* fix all matched delete for target build optimization

* fix test

* add info log

* add logs

* add debug logs

* add debug logs

* fix lint

* forbiden native engine for target build optimization

* add logs

* add more log

* add debug log

* fix multi chunks start offset and add skip chunk ut test

* support recieve duplicated block for matched_mutator

* move logic code

* fix flaky matched and fix offset for pointer (chunk_offsets shouldn't minus one)

* add merge_state

* refactor codes

* add more commnets

* refactor codes, split merge into optimziation codes into other files

* remove a.txt

* fix check

* chore: modify function name

* rename variables with merge_into prefix

* rename function

* move merge_into_try_build_block_info_index to front
  • Loading branch information
JackTan25 authored and Xuanwo committed Jan 19, 2024
1 parent 5b24fb0 commit dd26541
Show file tree
Hide file tree
Showing 58 changed files with 1,875 additions and 109 deletions.
10 changes: 6 additions & 4 deletions src/common/hashtable/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,18 @@
extern crate core;

mod container;
mod dictionary_string_hashtable;

mod hashjoin_hashtable;
mod hashjoin_string_hashtable;
mod hashtable;
mod keys_ref;
mod lookup_hashtable;
mod stack_hashtable;
mod table0;

mod dictionary_string_hashtable;
mod partitioned_hashtable;
mod short_string_hashtable;
mod stack_hashtable;
mod string_hashtable;
mod table0;
#[allow(dead_code)]
mod table1;
mod table_empty;
Expand Down Expand Up @@ -113,3 +113,5 @@ pub use partitioned_hashtable::hash2bucket;
pub type HashJoinHashMap<K> = hashjoin_hashtable::HashJoinHashTable<K>;
pub type BinaryHashJoinHashMap = hashjoin_string_hashtable::HashJoinStringHashTable;
pub use traits::HashJoinHashtableLike;
pub use utils::Interval;
pub use utils::MergeIntoBlockInfoIndex;
24 changes: 22 additions & 2 deletions src/common/hashtable/src/traits.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@
// limitations under the License.

// To avoid RUSTFLAGS="-C target-feature=+sse4.2" warning.
#![allow(unused_imports)]

use std::hash::BuildHasher;
use std::hash::Hasher;
use std::iter::TrustedLen;
use std::mem::MaybeUninit;
use std::num::NonZeroU64;
Expand Down Expand Up @@ -508,21 +507,42 @@ pub trait HashJoinHashtableLike {
type Key: ?Sized;

// Using hashes to probe hash table and converting them in-place to pointers for memory reuse.
// same with `early_filtering_probe`, but we don't use early_filter
fn probe(&self, hashes: &mut [u64], bitmap: Option<Bitmap>) -> usize;

// Using hashes to probe hash table and converting them in-place to pointers for memory reuse.
// 1. same with `early_filtering_probe_with_selection`, but we don't use selection to preserve the
// unfiltered indexes, we just set the filtered hashes as zero.
// 2. return the unfiltered counts.
fn early_filtering_probe(&self, hashes: &mut [u64], bitmap: Option<Bitmap>) -> usize;

// Using hashes to probe hash table and converting them in-place to pointers for memory reuse.
// we use `early_filtering_probe_with_selection` to do the first round probe.
// 1. `hashes` is the hash value of probe block's rows. we will use this one to
// do early filtering. if we can't early filter one row(at idx), we will assign pointer in
// the bucket to hashes[idx] to reuse the memory.
// 2. `selection` is used to preserved the indexes which can't be early_filtered.
// 3. return the count of preserved the indexes in `selection`
fn early_filtering_probe_with_selection(
&self,
hashes: &mut [u64],
valids: Option<Bitmap>,
selection: &mut [u32],
) -> usize;

// we use `next_contains` to see whether we can find a matched row in the link.
// the ptr is the link header.
fn next_contains(&self, key: &Self::Key, ptr: u64) -> bool;

/// 1. `key` is the serialize build key from one row
/// 2. `ptr` pointers to the *RawEntry for of the bucket correlated to key.So before this method,
/// we will do a round probe firstly. If the ptr is zero, it means there is no correlated bucket
/// for key
/// 3. `vec_ptr` is RowPtr Array, we use this one to record the matched row in chunks
/// 4. `occupied` is the length for vec_ptr
/// 5. `capacity` is the capacity of vec_ptr
/// 6. return macthed rows count and next ptr which need to test in the future.
/// if the capacity is enough, the next ptr is zero, otherwise next ptr is valid.
fn next_probe(
&self,
key: &Self::Key,
Expand Down
Loading

0 comments on commit dd26541

Please sign in to comment.