Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use binary for non-utf8 string type #14235

Merged
merged 36 commits into from
Jan 11, 2024

Conversation

andylokandy
Copy link
Collaborator

@andylokandy andylokandy commented Jan 4, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  1. Aggregate state: String -> Binary
  2. Bloom index state: String -> Binary
  3. Group by hash: String -> Binary
  4. Column to arrow: String: LargeBinary -> LargeUtf8
  5. to_base64(String) -> String -> to_base64(Binary) -> String
  6. from_base64(String) -> String -> from_base64(String) -> Binary
  7. to_hex(String) -> String -> to_hex(Binary) -> String
  8. from_hex(String) -> String -> from_hex(String) -> Binary
  9. Add utf8 check when: (a) {Binary, LargeBinary, FixLength}Array -> Column::String (2) native format read DataType::String

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Jan 4, 2024
@andylokandy andylokandy marked this pull request as ready for review January 8, 2024 11:43
@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2024

tests failed in py-binding

@andylokandy
Copy link
Collaborator Author

andylokandy commented Jan 9, 2024

https://github.com/datafuselabs/databend/blob/3709d5f1a01a7c580d732da82934b34a0bd6cd6d/tests/sqllogictests/suites/stage/formats/parquet/parquet_field_types.test#L174-L204

Are these appropriate behaviors? @b41sh @sundy-li

PS:

mysql> select * from infer_schema(location=>'@data/parquet/json_string.parquet');
+-------------+--------+----------+----------+
| column_name | type   | nullable | order_id |
+-------------+--------+----------+----------+
| a           | INT    |        1 |        0 |
| b           | BINARY |        1 |        1 |
+-------------+--------+----------+----------+
2 rows in set (0.02 sec)

mysql> select * from @data/parquet/json_string.parquet;
+------+----------------------------------------------------+
| a    | b                                                  |
+------+----------------------------------------------------+
|    1 | 7B226B31223A227631222C226B32223A227632227D         |
|    2 | 7B226B223A7B226B31223A3132332C226B32223A3435367D7D |
|    3 | 5B312C322C335D                                     |
|    4 | 5B2261222C2262225D                                 |
+------+----------------------------------------------------+

mysql> select typeof(b) from @data/parquet/json_string.parquet;
+-------------+
| typeof(b)   |
+-------------+
| BINARY NULL |
| BINARY NULL |
| BINARY NULL |
| BINARY NULL |
+-------------+

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2024

select * from @data/parquet/json_string.parquet;

Schema is binary because parquet did not have variant datatype, so we use binary to store variant.
Since we are reading from parquet, it's correct to show as binary.

But if the parquet file is inside fuse engine, it'll show as JSON since we have variant datatype.

@andylokandy
Copy link
Collaborator Author

It's ready for review

@BohuTANG BohuTANG requested a review from sundy-li January 10, 2024 14:28
@sundy-li sundy-li added this pull request to the merge queue Jan 11, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 11, 2024
@BohuTANG BohuTANG merged commit 8791af1 into databendlabs:main Jan 11, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants