Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): new filter execution framework #13846

Merged
merged 88 commits into from
Dec 29, 2023

Conversation

Dousir9
Copy link
Member

@Dousir9 Dousir9 commented Nov 29, 2023

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

The implementation Databend filter in the past

1

In the past, the execution of Databend Filter was implemented by the Evaluator. For the given SQL: select * from t where a = 3 and (b < 7.2 or b > 12.6) and c < 6;, The Evaluator processes each predicate (a = 3, b < 7.2, b > 12.6, c < 6) and generates a bitmap for each. These bitmaps are then pairwise combined using & and | operations based on the AND and OR, resulting in a final bitmap called filter. Finally, the Evaluator uses the filter to invoke the filter_boolean_value of the DataBlock, generating the filtered DataBlock.

The disadvantages of old implementation.

  1. Frequent construction/destruction of bitmaps will lead to significant memory fragmentation: each comparison operator, such as a = 3, will generate a bitmap for result. If the column is nullable, an additional bitmap is generated for validity. In other words, for a where condition like a = 3 and (b < 7.2 or b > 12.6) and c < 6, Evaluator will generate up to 8 bitmaps during execution for single DataBlock.

  2. The independent execution of filtering predicates can result in inefficiencies: If a certain row in a DataBlock has already been filtered out by one predicate, other predicates will still filter it again, leading to unnecessary CPU overhead.

  3. using filter bitmap to invoke filter_boolean_value on the DataBlock to generate the filtered DataBlock may not always be optimal.

New Filter Execution Framework

We introduced a groundbreaking concept, defining it as the "Immutable Index". By combining the Immutable Index with the SelectStrategy, we have addressed the drawbacks of DuckDB Filter ! 🚀, the Immutable Index enables us to avoid generating temporary selection buffer when encountering AND and OR operations. This not only reduces memory fragmentation but It also eliminates the cyclic copying from temporary selection to final selection.

4
  1. New filter execution framework avoid memory allocation by using reusable true_selection and false_selection (only generating false_selection when OR predicates are present) instead of bitmaps..

  2. The execution of predicates is dynamically linked through true_selection and false_selection, ensuring that each row in the DataBlock is filtered only once. This significantly optimizes performance (TPC-H Q12, Q18) and accommodates complex predicates effectively.

  3. By employing a heuristic strategy and dynamically choosing between using take or take_range to generate the DataBlock, this approach is more efficient than using filter bitmap to invoke filter_boolean_value on the DataBlock.

Benchmark

Q19 External Parquet: 14.5s -> 9.7s
before:
6
after:
7

  • Closes #issue

This change is Reviewable

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Nov 29, 2023
@BohuTANG BohuTANG removed the ci-benchmark Benchmark: run all test label Dec 29, 2023
@BohuTANG
Copy link
Member

Conflicting files
src/query/storages/fuse/src/operations/read/native_data_source_deserializer.rs

@BohuTANG BohuTANG merged commit 5ee08a3 into databendlabs:main Dec 29, 2023
70 checks passed
@Dousir9 Dousir9 deleted the improve_filter_execution branch December 29, 2023 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants