Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: new impl for loading CSV. #14645

Merged
merged 10 commits into from
Feb 20, 2024
Merged

Conversation

youngsofun
Copy link
Member

@youngsofun youngsofun commented Feb 8, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

the original InputContex is intended to include all kinds of file formats, and both streaming load and copy, which turn to be too complicated to maintain, and there are many differences in nature between parquet and text files.
this pr introduce new pipeline for text, row-based formats, and force on copy.
the code is mainly in databend-common-storages-stage::read::row_based
CSV format is refactored with this new pipeline firstly.

logically this is a refactor, but at code level many codes are rewritten (some details are refined by the way). So Add new setting enable_new_copy_for_text_formats.
if there are some problem with new impl, user can fallback to the old one.
because there are already many tests for CSV, I set enable_new_copy_for_text_formats=1 by default.

  • Fixes #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Feb 8, 2024
@youngsofun youngsofun force-pushed the text branch 2 times, most recently from 19a2806 to cd0b3f6 Compare February 8, 2024 02:03
@youngsofun youngsofun marked this pull request as draft February 8, 2024 02:10
@youngsofun youngsofun force-pushed the text branch 3 times, most recently from 63aa4d6 to a0a696d Compare February 8, 2024 03:02
@youngsofun youngsofun marked this pull request as ready for review February 8, 2024 03:27
@youngsofun
Copy link
Member Author

youngsofun commented Feb 18, 2024

@b41sh thank you very much! please review again (take your time, this PR is not in hurry).

@sundy-li
Copy link
Member

sundy-li commented Feb 19, 2024

Do you have any performance results with the old impl ?

You can use copy into table with null engine to test the performance.

@youngsofun
Copy link
Member Author

Do you have any performance results with the old impl ?

You can use copy into table with null engine to test the performance.

ok, I will try it later.

as a refactor, the only place that may lead to diff in performance is the reading of file data (less batch buffered).
If there really is a difference in this aspect, it would only become apparent when running on slower storage.

@youngsofun
Copy link
Member Author

youngsofun commented Feb 20, 2024

@sundy-li

A preliminary test indicates negligible performance difference (less than 1%).

Settings:

A single CSV file of 700MB with two integer columns
Tested in both compressed and uncompressed formats.

--

this test is too cpu bounded, add another:

  • one column with a 100-char string
  • add max_threads = 1 and 10

not diff either.

@BohuTANG BohuTANG merged commit 346a955 into databendlabs:main Feb 20, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants