Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7028][CH][Part-1] Using PushingPipelineExecutor to write merge tree #7029

Merged
merged 41 commits into from
Sep 6, 2024

Conversation

baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Aug 27, 2024

What changes were proposed in this pull request?

This PR refactors SparkMergeTreeWriter, using PushingPipelineExecutor to write mergetree instead of manually written codes. SparkMergeTreeWriter did 4 different tasks

  1. Using DB::Squashing to merge blocks into one bigger blocks, this functionality is now done by PlanSquashingTransform and ApplySquashingTransform
  2. Serialization,this functionality is now done by SparkMergeTreeDataWriter
  3. Merge smaller parts into bigger parts and finally copy to remote storage, this functionality is now done by SinkHelper and it's derived classes.
  4. Connect the above process together, this is done by SparkMergeTreeSinkand PushingPipelineExecutor

The current work flow looks like:

image

We did this works for two reasons:

  1. Spark 3.4 introduce WriteFilesExec, so now we can write parquet and orc in one native pipeline wittout modify spark source codes, see [GLUTEN-6067][CH] [Part 3-2] Basic support for Native Write in Spark 3.5 #6586
  2. In spark 3.2 and 3.3, writing parquet and orc also use PushingPipelineExecutor.

After this PR, we can unify writing for all formats for spark 3.2, 3.3 and 3.5.

Other Refactor:

  1. Fix Typo, rename Storage/Mergetree to Storage/MergeTree
  2. Move Mergetree related codes into Storage/MergeTree
  3. Introduce GlutenSettigns.h to simpify read setting
  4. Rename CustomStorageMergeTree.h/.cpp to SparkStorageMergeTree.c/cpp
    1. Rename CustomStorageMergeTree to SparkStorageMergeTree
    2. Add SparkWriteStorageMergeTree and implement write method to create SparkMergeTreeSink.
  5. Rename MergeTreeTool.h/.cpp to SparkMergeTreeMeta.c/cpp
    1. Create MergeTreeTableInstance and inherit from MergeTreeTable
    2. Move meta related codes from mergetreeparser to these two classes.

(Fixes: #7028)

How was this patch tested?

Using Existed Tests

Copy link

#7028

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@lgbo-ustc
Copy link
Contributor

LGTM

Copy link

github-actions bot commented Sep 2, 2024

Run Gluten Clickhouse CI

2 similar comments
Copy link

github-actions bot commented Sep 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 2, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Sep 3, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 3, 2024

Run Gluten Clickhouse CI

3 similar comments
Copy link

github-actions bot commented Sep 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 5, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 5, 2024

Run Gluten Clickhouse CI

@loneylee
Copy link
Member

loneylee commented Sep 6, 2024

LGTM

@baibaichen baibaichen merged commit 05f54f1 into apache:main Sep 6, 2024
9 checks passed
@baibaichen baibaichen deleted the feature/one_pipeline branch September 6, 2024 03:08
baibaichen added a commit to Kyligence/gluten that referenced this pull request Sep 6, 2024
baibaichen added a commit that referenced this pull request Sep 6, 2024
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240906)

* Fix build due to ClickHouse/ClickHouse#65832

* Fix UT due to ClickHouse/ClickHouse#65832

* Fix conflict with #7122

* Fix conflict with #7029

* Run GlutenClickHouseMergeTreeCacheDataSSuite locally

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
dcoliversun pushed a commit to dcoliversun/gluten that referenced this pull request Sep 11, 2024
…rge tree (apache#7029)

* 1. Rename Storages/Mergetree to Storages/MergeTree
2. Move MergeTreeTool.cpp/.h from Common to Storages/MergeTree
3. Move CustomStorageMergeTree.cpp/.h and StorageMergeTreeFactory.cpp/.h to MergeTree  folderMove CustomStorageMergeTree.cpp/.h and StorageMergeTreeFactory.cpp/.h to MergeTree  folder
4. Add CustomMergeTreeDataWriter
5. Remove TempStorageFreer
6. Add SubstraitParserUtils

* Make query_map_ as QueryContextManager member

* EMBEDDED_PLAN and create_plan_and_executor

* minor refactor

* tmp

* SparkStorageMergeTree
CustomMergeTreeDataWriter => SparkMergeTreeDataWriter

* Add SparkMergeTreeSink

* use SparkStorageMergeTree and SparkMergeTreeSink

* Introduce GlutenSettings.h

* GlutenMergeTreeWriteSettings

* Fix Test Build

* typo

* ContextPtr => const ContextPtr &

* minor refactor

* fix style

* using GlutenMergeTreeWriteSettings

* [TMP] GlutenMergeTreeWriteSettings refactor

* [TMP] StorageMergeTreeWrapper

* [TMP] StorageMergeTreeWrapper::commitPartToRemoteStorageIfNeeded

* [TMP] StorageMergeTreeWrapper::saveMetadata

* move thread pool

* tmp

* rename

* move to sparkmergetreesink.h/cpp

* MergeTreeTableInstance

* sameStructWith => sameTable

* parseStorageAndRestore => restoreStorage
parseStorage => getStorage

* Sink with MergeTreeTable table;

* remvoe SparkMergeTreeWriter::writeTempPartAndFinalize

* refactor SinkHelper::writeTempPart

* Remove write_setting of SparkMergeTreeWriter

* SparkMergeTreeWriter using PushingPipelineExecutor

* SparkMergeTreeWriteSettings

* tmp

* GlutenMergeTreeWriteSettings => SparkMergeTreeWriteSettings

* make CustomStorageMergeTree constructor protected

* MergeTreeTool.cpp/.h => SparkMergeTreeMeta.cpp/.h

* CustomStorageMergeTree.cpp/.h => SparkStorageMergeTree.cpp/.h

* CustomStorageMergeTree => SparkStorageMergeTree
SparkStorageMergeTree => SparkWriteStorageMergeTree

* Refactor move codes from MergeTreeRelParser to MergeTreeTable and MergeTreeTableInstance

* Refactor Make static member to normal member
dcoliversun pushed a commit to dcoliversun/gluten that referenced this pull request Sep 11, 2024
)

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240906)

* Fix build due to ClickHouse/ClickHouse#65832

* Fix UT due to ClickHouse/ClickHouse#65832

* Fix conflict with apache#7122

* Fix conflict with apache#7029

* Run GlutenClickHouseMergeTreeCacheDataSSuite locally

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
…rge tree (apache#7029)

* 1. Rename Storages/Mergetree to Storages/MergeTree
2. Move MergeTreeTool.cpp/.h from Common to Storages/MergeTree
3. Move CustomStorageMergeTree.cpp/.h and StorageMergeTreeFactory.cpp/.h to MergeTree  folderMove CustomStorageMergeTree.cpp/.h and StorageMergeTreeFactory.cpp/.h to MergeTree  folder
4. Add CustomMergeTreeDataWriter
5. Remove TempStorageFreer
6. Add SubstraitParserUtils

* Make query_map_ as QueryContextManager member

* EMBEDDED_PLAN and create_plan_and_executor

* minor refactor

* tmp

* SparkStorageMergeTree
CustomMergeTreeDataWriter => SparkMergeTreeDataWriter

* Add SparkMergeTreeSink

* use SparkStorageMergeTree and SparkMergeTreeSink

* Introduce GlutenSettings.h

* GlutenMergeTreeWriteSettings

* Fix Test Build

* typo

* ContextPtr => const ContextPtr &

* minor refactor

* fix style

* using GlutenMergeTreeWriteSettings

* [TMP] GlutenMergeTreeWriteSettings refactor

* [TMP] StorageMergeTreeWrapper

* [TMP] StorageMergeTreeWrapper::commitPartToRemoteStorageIfNeeded

* [TMP] StorageMergeTreeWrapper::saveMetadata

* move thread pool

* tmp

* rename

* move to sparkmergetreesink.h/cpp

* MergeTreeTableInstance

* sameStructWith => sameTable

* parseStorageAndRestore => restoreStorage
parseStorage => getStorage

* Sink with MergeTreeTable table;

* remvoe SparkMergeTreeWriter::writeTempPartAndFinalize

* refactor SinkHelper::writeTempPart

* Remove write_setting of SparkMergeTreeWriter

* SparkMergeTreeWriter using PushingPipelineExecutor

* SparkMergeTreeWriteSettings

* tmp

* GlutenMergeTreeWriteSettings => SparkMergeTreeWriteSettings

* make CustomStorageMergeTree constructor protected

* MergeTreeTool.cpp/.h => SparkMergeTreeMeta.cpp/.h

* CustomStorageMergeTree.cpp/.h => SparkStorageMergeTree.cpp/.h

* CustomStorageMergeTree => SparkStorageMergeTree
SparkStorageMergeTree => SparkWriteStorageMergeTree

* Refactor move codes from MergeTreeRelParser to MergeTreeTable and MergeTreeTableInstance

* Refactor Make static member to normal member
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
)

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240906)

* Fix build due to ClickHouse/ClickHouse#65832

* Fix UT due to ClickHouse/ClickHouse#65832

* Fix conflict with apache#7122

* Fix conflict with apache#7029

* Run GlutenClickHouseMergeTreeCacheDataSSuite locally

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Fully Support writing parquet and mergetree in spark 3.5.x with delta protocol
3 participants