Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Overflow when reading Timestamp from parquet file #542

Merged
merged 2 commits into from
Jun 12, 2024

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jun 8, 2024

Which issue does this PR close?

Closes #481.

Rationale for this change

When spark reads and writes timestamps in parquet file it using the following code: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L48-L66 to convert from the Long microsecond timestamp into the julian date + nano second format used in parquet. Because of the logic is implement dates like 290000-12-31T01:00:00+02:00 will lead to overflow both when encoding and decoding the value. Since it for both the reading and writing it "cancels out" and still gives the expected results. This means the date in year 290000 is stored with a negative day offset, which to me is a bit unexpected.

What changes are included in this PR?

This changes the comet code to use wrapping_mul/add to make it explicit that wrapping overflow is expected and needed to match Spark behavior.

How are these changes tested?

Existing and new unit test in CometCastSuite.

@eejbyfeldt eejbyfeldt marked this pull request as ready for review June 8, 2024 18:11
Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @eejbyfeldt

@eejbyfeldt eejbyfeldt force-pushed the i481-parquet-overflow branch from da1521d to a4be042 Compare June 11, 2024 15:08
@andygrove andygrove merged commit baad197 into apache:main Jun 12, 2024
43 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* fix: Overflow when reading Timestamp from parquet file

* Add helper method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: attempt to multiply with overflow in cast string to date
4 participants