fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

huaxingao · 2024-03-29T22:27:56Z

Which issue does this PR close?

Closes #.

Rationale for this change

UUID is not a SQL type in Spark, but is supported by Iceberg and is mapped to UTF8String. In order to support UUID, we have mapped FIXED_LEN_BYTE_ARRAY to String, but we can only do this if the FIXED_LEN_BYTE_ARRAY is UUID. This PR adds the check so we only map FIXED_LEN_BYTE_ARRAY to String for UUID type.

What changes are included in this PR?

How are these changes tested?

huaxingao · 2024-03-29T22:56:06Z

cc @viirya @snmvaughan

viirya · 2024-03-29T23:05:16Z

common/src/main/java/org/apache/comet/parquet/TypeUtil.java

-            || sparkType == DataTypes.StringType) {
+            || sparkType == DataTypes.StringType
+                && descriptor.getPrimitiveType().getLogicalTypeAnnotation()
+                    instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation) {


How can we test this?

One another question is, this is parquet uuid logical type, why it is Iceberg specified?

How can we test this?

I have made change on my local, and used the local build to test UUID in iceberg table.

this is parquet uuid logical type, why it is Iceberg specified?

I don't think Spark support UUID type. If the code comes to this line, the type has to be Iceberg UUID. If we want to be absolutely sure, we can add a flag when creating ColumnReader in Iceberg.

hmmm, one related question: is Iceberg reader supported in Comet yet?

It seems like that Comet doesn't support Iceberg reader yet? Once it's added, we can test this then?

Currently, this can only be tested on my local.

@viirya If the SparkType is StringType and LogicalTypeAnnotation is UUID, then this must be iceberg UUID column, because only iceberg maps UUID to Spark StringType. I feel the change is safe. Or we can add an extra parameter in getColumnReader to indicate whether the ColumnReader is an Iceberg ColumnReader.

codecov-commenter · 2024-03-29T23:12:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.48%. Comparing base (aa6ddc5) to head (98ceaab).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main     #238   +/-   ##
=========================================
  Coverage     33.48%   33.48%           
  Complexity      776      776           
=========================================
  Files           108      108           
  Lines         37178    37178           
  Branches       8146     8146           
=========================================
  Hits          12448    12448           
  Misses        22107    22107           
  Partials       2623     2623

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

viirya · 2024-04-04T15:25:55Z

Merged. Thanks.

huaxingao · 2024-04-04T15:42:40Z

Thanks every one for reviewing

* Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type * remove redundant code --------- Co-authored-by: Huaxin Gao <[email protected]>

Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type

243c736

viirya reviewed Mar 29, 2024

View reviewed changes

viirya approved these changes Apr 4, 2024

View reviewed changes

remove redundant code

98ceaab

snmvaughan approved these changes Apr 4, 2024

View reviewed changes

viirya merged commit e0b8db1 into apache:main Apr 4, 2024
28 checks passed

huaxingao deleted the uuid branch April 4, 2024 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

huaxingao commented Mar 29, 2024

huaxingao commented Mar 29, 2024

viirya Mar 29, 2024

singhpk234 Mar 29, 2024

huaxingao Mar 29, 2024

advancedxy Apr 1, 2024

huaxingao Apr 1, 2024

huaxingao Apr 4, 2024

viirya Apr 4, 2024

codecov-commenter commented Mar 29, 2024 •

edited

Loading

viirya commented Apr 4, 2024

huaxingao commented Apr 4, 2024

fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

fix: Only maps FIXED_LEN_BYTE_ARRAY to String for uuid type #238

Conversation

huaxingao commented Mar 29, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

huaxingao commented Mar 29, 2024

viirya Mar 29, 2024

Choose a reason for hiding this comment

singhpk234 Mar 29, 2024

Choose a reason for hiding this comment

huaxingao Mar 29, 2024

Choose a reason for hiding this comment

advancedxy Apr 1, 2024

Choose a reason for hiding this comment

huaxingao Apr 1, 2024

Choose a reason for hiding this comment

huaxingao Apr 4, 2024

Choose a reason for hiding this comment

viirya Apr 4, 2024

Choose a reason for hiding this comment

codecov-commenter commented Mar 29, 2024 • edited Loading

Codecov Report

viirya commented Apr 4, 2024

huaxingao commented Apr 4, 2024

codecov-commenter commented Mar 29, 2024 •

edited

Loading