Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update timestamps.pyx #60624

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

johnasiano
Copy link

@johnasiano johnasiano commented Dec 30, 2024

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Proposing a change to fix this issue: #60583. My understanding is that the bug is due to the code logic described in timestamps.pyx.

The first option that I considered was modifying normalize in timestamps.pyx. The second option that I considered was modifying normalize_i8_stamp() in timestamps.pyx, and also updating normalize_i8_stamp() in timestamps.pxd to update the normalize_i8_stamp() declaration to match the new signature.

datetimes.py would not be modified because my understanding is that this overflow issue is not present in datetimes.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Can you give this PR a more descriptive title. Please follow the guide here:

https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#making-a-pull-request

Also, when changing behavior, always add tests.

_Timestamp ts

# Check for potential overflow before normalization
if local_val < INT64_MIN or local_val > INT64_MAX:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_val is an int64, so don't these conditions always evaluate to false

Copy link
Author

@johnasiano johnasiano Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

Since pd.Timestamp.min is set to the smallest possible value in two's complement, then when we try to subtract even a small positive number, it can't get even "more negative". So would it be fair to say that what was described in the initial issue isn't necessarily a bug of pandas but rather just a constraint of the two's complement arithmetic that pandas uses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of adding a check that always evaluates to False regardless of the value of local_val?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you declare local_val as int64_t here you can just add the Cython @cython.overflowcheck(True) decorator to this function. That should greatly simplify what you are trying to do here while being much more performant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, @rhshadrach, I meant to say that you were correct in pointing that out, and I agree that this line
if local_val < INT64_MIN or local_val > INT64_MAX: is not a good way to check for the overflow.

@WillAyd, are you proposing a change to the normalize function? Or to int64_t normalize_i8_stamp? Because in the return for int64_t normalize_i8_stamp, I think the subtraction of any positive value from local_val, when local_val is the Timestamp.min, is what is causing the wrap around.

Also, what should be the expected behavior if overflow occurs during normalization? For example, should the code raise an exception, return the original timestamp, return NaT, or do something else entirely?

Copy link
Member

@WillAyd WillAyd Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the subtraction of any positive value from local_val, when local_val is the Timestamp.min, is what is causing the wrap around.

Wherever this is happening you can use the decorator

Also, what should be the expected behavior if overflow occurs during normalization? For example, should the code raise an exception, return the original timestamp, return NaT, or do something else entirely?

It should raise an error. Signed overflow is undefined behavior - we can't do anything about it but raise in advance of that happening

@rhshadrach rhshadrach added Bug Timestamp pd.Timestamp and associated methods labels Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants