Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: support non-nanos Timedelta objects in Python C API #55213

Conversation

BenjaminHelyer
Copy link

…ich encapsulates the bug. Outlined a couple more tests.
…sted two test cases for ms and added several test cases for us and ns.
… bound of pytimedeltas. Moved constants in timedeltas.pyx to non-magic number constants at top of file.
# exposing C-API functions for testing purposes only
# consistent behavior is NOT guaranteed
return PyDateTime_DELTA_GET_MICROSECONDS(self)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would appreciate any ideas for alternatives on how to expose these C-API functions for pytest. It doesn't seem that there is a direct way to access cdef functions in Pytest unless they are wrapped in a Python function, and I saw this done similarly in the codebase with how _from_value_and_reso exposes the cdef function _timedelta_from_value_and_reso.

Ultimately, I fear that exposing these here will just make the problem worse, as perhaps someone will find this and start trying to use these wrapped C-APIs directly in their Python code. I hope that the comment makes it clear that this usage won't be supported, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, could we just use pyarrow to indirectly test that this works? e.g. a unit tests that tests apache/arrow#37291 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so long as there aren't any objections to importing pyarrow in the tests! I see some other test files do use it, so I'll go ahead and go for this method.

Copy link
Author

@BenjaminHelyer BenjaminHelyer Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the suggestion @mroeschke . After looking at it further, one hesitancy I have with using pyarrow for these test cases is that we expect the pyarrow side to be fixed at some point to use the Pandas interface rather than the C-APIs, per the comment from @jorisvandenbossche in the issue on the pyarrow side.

In other words, in using pyarrow for tests, we'd be testing pyarrow's interface, not the C-APIs. So long as this issue is regarding the C-APIs more generally, I suggest we test these directly (unless we foresee a danger to exposing these functions, in which case I agree pyarrow is a better option).

Edit: FWIW I also get some errors at the extremes from the pyarrow side, e.g., pa.scalar(timedelta(seconds=86399999913600)) raises an error while that timedelta is a valid pytimedelta. So it seems this fix is more general than pyarrow, but less general than the entirety of pytimedelta inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Generally we don't have any testing with Pythons C APIs, so I'm not sure how much pandas can guarantee compatibility. I just suggested the test case in the PyArrow issue to test a pragmatic real world use case

@BenjaminHelyer
Copy link
Author

If someone can confirm that the CI docs/mypy failures are due to my changes, let me know and I'll address. Right now I can't tell how these are related, and the same two failures appear to be failing on other recent PRs, so I'm going to pause on spending time root causing them until someone can confirm otherwise.

# whereas -999999999 days, -86399 seconds is not in the range

# upper bound for unit seconds: 1000000000 days * 86400 s/day
cdef int64_t PYTIMEDELTA_UPPER_S = 86400000000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these importable from Python.h? Or where did these come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out -- found the constant in the Python datetime module today and just made a new commit that uses it. Left comments and examples as they are as I think it's illustrative to show how large these bounds are. Let me know if you have further thoughts on how this was done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized I'm doing this in the wrong way...trying a couple of alternatives now to import from CPython headers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally realized that the constant is not defined in datetime.h, but in some other c file in CPython. Defined it in the exact same way in this pandas file, but we can always squash this revision to just use magic numbers as in the first set of commits.

If there's a way I could import the constant from that c file in CPython, do let me know! I just can't think of or find a clear cut way to do that.

if value < PYTIMEDELTA_UPPER_MS and value >= PYTIMEDELTA_LOWER_MS:
td_base = _Timedelta.__new__(cls, milliseconds=int(value))
else:
td_base = _Timedelta.__new__(cls, milliseconds=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we just want to raise before sending to the C-API for pytimedeltas? Using 0 here is a bit arbitrary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I left it as zero thinking that might help with backwards-compatibility, since that was the previous behavior. I think (though I may be wrong, please correct me) that pandas supports a wider range of timedelta values than python, so we don't want to break existing code that leverages that feature of pandas in an attempt to make these python C-APIs work.

Perhaps we raise a warning rather than an error? Quietly sending a zero was part of the problem with the previous implementation, so perhaps we should at least raise a warning showing that the zero behavior is intentional.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I coded up a few possibilities with a warning, then ended up scratching them all due to the potential confusion to users. So how about this simple implementation? Or if this sparks any different ideas for you, I'm all ears!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I still don't really understand the point of setting to 0 - wouldn't we want to just let these raise? If you do so you can really simplify your test cases as well; just set up scenarios that we know should and shouldn't exceed the bounds and test the constructor against those

@jorisvandenbossche for thoughts in case I am overlooking something

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to let them raise given the original comment in the code, which reads:

# For millisecond and second resos, we cannot actually pass int(value) because 
#  many cases would fall outside of the pytimedelta implementation bounds. 

@jorisvandenbossche also alluded to cases in which pd.Timedelta allows for values outside of the range allowed by pytimedelta, but I've been unable to verify if any of these actually exist.

Agree that we need further clarity -- if there's no supported case in pandas in which a pytimedelta error would be raised, then we should just let the pytimedelta errors raise. But the comment in the code and Joris' original description make me concerned that there might be such cases, in which case we want to allow Pandas to proceed on without pytimedelta raising an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (though I may be wrong, please correct me) that pandas supports a wider range of timedelta values than python

Correct. Non-nanosecond support for timestamps/timedeltas means we can create Timestamps/Timedeltas that the stdlib datetime/timedelta cannot handle. Passing zero here was the best idea I had, but I'm open to other options. I don't think "just raise" is compatible with current non-nano support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our timedelta still inherits from the CPython timedelta right? If so, I think we need to first figure out a way to decouple that if we plan on supporting things outside of the range as a precursor. My main concern is that we'd be fighting the inheritance tree otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timedelta subclasses pytimedelta yes. But the C-API does not respect subclasses, so we aren't doing anything wrong inheritance tree-wise


# the relevant constant is defined in CPython _datetimemodule.c,
# rather than datetime.h, so we have to define it again here
cdef int64_t MAX_DELTA_DAYS = 999999999
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I see these are located in "_datetimemodule.c" so its not part of the CPython API. I don't think we should be doing this at all then, as we are mixing into CPython internals.

It looks like CPython anyways already does bounds checking when creating a timedelta, so we should just rely on that.

https://github.com/python/cpython/blob/d4cea794a7b9b745817d2bd982d35412aef04710/Modules/_datetimemodule.c#L1101

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The key here is to avoid breaking cases in which pd.Timedelta allows for values outside the bounds of pytimedelta, as referenced by the original code comment. I’m thinking we try to pass the value, and pass zero upon an OverflowError from CPython (possibly raising a warning). Is that along the lines of what you're thinking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loosely. I'm still probably missing the bigger picture, but I think at least catching the OverflowError is headed in the right direction. So worth trying that instead and see where that takes us

@jbrockmendel
Copy link
Member

Catching up a bit, im skeptical that it is possible to make the python C API work with non-nanosecond Timedeltas. (same issues will apply to Timestamps).

@BenjaminHelyer
Copy link
Author

Catching up a bit, im skeptical that it is possible to make the python C API work with non-nanosecond Timedeltas. (same issues will apply to Timestamps).

Are you saying that we can make the C-API work with non-nanosecond Timedeltas, or that we can't make it work?

If it's the former, I'm all ears for a broader way to do this. If it's the latter, I may be misunderstanding the issue. Some of my test cases (specifically the ones in test_timedelta_as_unit_conversion) do successfully transform a non-nanosecond unit into one that is within the bounds of the C-APIs. The C-APIs are then accessed to confirm that the Timedelta was successfully transformed (and not just zero). This was to address the concern of the original bug; I also tested it with pyarrow to the same effect. Is this different from the behavior that we're concerned about?

@jbrockmendel
Copy link
Member

Are you saying that we can make the C-API work with non-nanosecond Timedeltas, or that we can't make it work?

I believe we can not make the C-API work in the general case. Downstream libraries will need to use the python API.

@mroeschke
Copy link
Member

I think this PR should handle the "best effort" compatibility with the C-API for now.

… as well as using a Timedelta to assert on the last two tests.
@BenjaminHelyer
Copy link
Author

BenjaminHelyer commented Sep 25, 2023

Forgive me for asking a potentially dumb question here, but what if we only support large, low-resolution values within the bounds of pytimedeltas, while supporting small, higher-resolution values outside the bounds of pytimedeltas (i.e., nanoseconds and smaller)? Just because we can create larger low-resolution values doesn't necessarily mean we should support them.

I ask this mostly because I foresee many use cases for nanosecond and sub-nanosecond timedeltas (I myself have run into them "in the wild"), but I don't foresee very many use cases for large, low-resolution timedeltas greater than the order of 2 million years. Perhaps in fields like astrophysics and geology this would be important, but at that point we might as well consider whether it's worth it to support even lower resolution timedeltas which are as long as the age of the universe.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 27, 2023
@BenjaminHelyer
Copy link
Author

Bumping this issue awake again: do we want to continue pressing forward on a solution to #54682 with this PR? Or do we want the conclusion to be that, since downstream libraries need to use the python API, we won't support the C-APIs for timedeltas?

Would appreciate any thoughts from @jbrockmendel and @mroeschke . If we don't want to move forward with a fix in Pandas, we should probably close #54682 and move further discussion over to the PyArrow side.

@jbrockmendel
Copy link
Member

AFAICT what is being asked is impossible and this should be closed.

@BenjaminHelyer
Copy link
Author

AFAICT what is being asked is impossible and this should be closed.

That aligns with my interpretation + learnings from the code as well. I think we can fix this specific issue with PyArrow, but the general problem would still remain.

@mroeschke unless you have significant objections, I'll close this PR and then the issue on the Pandas side can be closed. The fix can then be continued on the PyArrow side.

@mroeschke
Copy link
Member

I don't have objection to deem this out of scope for pandas. Even though Timstamp/Timedelta subclass the associated Python objects, we have no compatibility testing with C-APIs. IMO it would still be nice to have compatibility, but I will defer to @jbrockmendel in terms of feasibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: support non-nanos Timedelta objects in Python C API
4 participants