Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

Merged
merged 2 commits into from
Apr 4, 2024

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

UUID is not a SQL type in Spark, but is supported by Iceberg and is mapped to UTF8String. In order to support UUID, we have mapped FIXED_LEN_BYTE_ARRAY to String, but we can only do this if the FIXED_LEN_BYTE_ARRAY is UUID. This PR adds the check so we only map FIXED_LEN_BYTE_ARRAY to String for UUID type.

What changes are included in this PR?

How are these changes tested?

@huaxingao
Copy link
Contributor Author

cc @viirya @snmvaughan

|| sparkType == DataTypes.StringType) {
|| sparkType == DataTypes.StringType
&& descriptor.getPrimitiveType().getLogicalTypeAnnotation()
instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test this?

One another question is, this is parquet uuid logical type, why it is Iceberg specified?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test this?

I have made change on my local, and used the local build to test UUID in iceberg table.

this is parquet uuid logical type, why it is Iceberg specified?

I don't think Spark support UUID type. If the code comes to this line, the type has to be Iceberg UUID. If we want to be absolutely sure, we can add a flag when creating ColumnReader in Iceberg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, one related question: is Iceberg reader supported in Comet yet?

It seems like that Comet doesn't support Iceberg reader yet? Once it's added, we can test this then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this can only be tested on my local.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya If the SparkType is StringType and LogicalTypeAnnotation is UUID, then this must be iceberg UUID column, because only iceberg maps UUID to Spark StringType. I feel the change is safe. Or we can add an extra parameter in getColumnReader to indicate whether the ColumnReader is an Iceberg ColumnReader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.48%. Comparing base (aa6ddc5) to head (98ceaab).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #238   +/-   ##
=========================================
  Coverage     33.48%   33.48%           
  Complexity      776      776           
=========================================
  Files           108      108           
  Lines         37178    37178           
  Branches       8146     8146           
=========================================
  Hits          12448    12448           
  Misses        22107    22107           
  Partials       2623     2623           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya viirya merged commit e0b8db1 into apache:main Apr 4, 2024
28 checks passed
@viirya
Copy link
Member

viirya commented Apr 4, 2024

Merged. Thanks.

@huaxingao
Copy link
Contributor Author

Thanks every one for reviewing

@huaxingao huaxingao deleted the uuid branch April 4, 2024 15:42
snmvaughan pushed a commit to snmvaughan/arrow-datafusion-comet that referenced this pull request Apr 4, 2024
* Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type

* remove redundant code

---------

Co-authored-by: Huaxin Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants