fix: Overflow when reading Timestamp from parquet file #542

eejbyfeldt · 2024-06-08T13:00:55Z

Which issue does this PR close?

Closes #481.

Rationale for this change

When spark reads and writes timestamps in parquet file it using the following code: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L48-L66 to convert from the Long microsecond timestamp into the julian date + nano second format used in parquet. Because of the logic is implement dates like 290000-12-31T01:00:00+02:00 will lead to overflow both when encoding and decoding the value. Since it for both the reading and writing it "cancels out" and still gives the expected results. This means the date in year 290000 is stored with a negative day offset, which to me is a bit unexpected.

What changes are included in this PR?

This changes the comet code to use wrapping_mul/add to make it explicit that wrapping overflow is expected and needed to match Spark behavior.

How are these changes tested?

Existing and new unit test in CometCastSuite.

parthchandra

LGTM

core/src/parquet/read/values.rs

comphead

lgtm thanks @eejbyfeldt

* fix: Overflow when reading Timestamp from parquet file * Add helper method

eejbyfeldt marked this pull request as ready for review June 8, 2024 18:11

parthchandra approved these changes Jun 10, 2024

View reviewed changes

comphead reviewed Jun 10, 2024

View reviewed changes

core/src/parquet/read/values.rs Outdated Show resolved Hide resolved

comphead approved these changes Jun 10, 2024

View reviewed changes

eejbyfeldt added 2 commits June 11, 2024 16:57

fix: Overflow when reading Timestamp from parquet file

445b6bf

Add helper method

a4be042

eejbyfeldt force-pushed the i481-parquet-overflow branch from da1521d to a4be042 Compare June 11, 2024 15:08

andygrove merged commit baad197 into apache:main Jun 12, 2024
43 checks passed

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

fix: Overflow when reading Timestamp from parquet file (apache#542)

0103775

* fix: Overflow when reading Timestamp from parquet file * Add helper method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Overflow when reading Timestamp from parquet file #542

fix: Overflow when reading Timestamp from parquet file #542

eejbyfeldt commented Jun 8, 2024 •

edited

Loading

parthchandra left a comment

comphead left a comment

fix: Overflow when reading Timestamp from parquet file #542

fix: Overflow when reading Timestamp from parquet file #542

Conversation

eejbyfeldt commented Jun 8, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

parthchandra left a comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

eejbyfeldt commented Jun 8, 2024 •

edited

Loading