BUG: support non-nanos Timedelta objects in Python C API #55213

BenjaminHelyer · 2023-09-20T04:18:32Z

closes BUG: support non-nanos Timedelta objects in Python C API #54682
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…ich encapsulates the bug. Outlined a couple more tests.

…mplemented more test cases.

…sted two test cases for ms and added several test cases for us and ns.

… bound of pytimedeltas. Moved constants in timedeltas.pyx to non-magic number constants at top of file.

…ix-non-nano-timedelta-c-api

BenjaminHelyer · 2023-09-20T04:34:05Z

pandas/_libs/tslibs/timedeltas.pyx

+        # exposing C-API functions for testing purposes only
+        # consistent behavior is NOT guaranteed
+        return PyDateTime_DELTA_GET_MICROSECONDS(self)
+


Would appreciate any ideas for alternatives on how to expose these C-API functions for pytest. It doesn't seem that there is a direct way to access cdef functions in Pytest unless they are wrapped in a Python function, and I saw this done similarly in the codebase with how _from_value_and_reso exposes the cdef function _timedelta_from_value_and_reso.

Ultimately, I fear that exposing these here will just make the problem worse, as perhaps someone will find this and start trying to use these wrapped C-APIs directly in their Python code. I hope that the comment makes it clear that this usage won't be supported, though.

Instead, could we just use pyarrow to indirectly test that this works? e.g. a unit tests that tests apache/arrow#37291 (comment)

Sure, so long as there aren't any objections to importing pyarrow in the tests! I see some other test files do use it, so I'll go ahead and go for this method.

Thanks again for the suggestion @mroeschke . After looking at it further, one hesitancy I have with using pyarrow for these test cases is that we expect the pyarrow side to be fixed at some point to use the Pandas interface rather than the C-APIs, per the comment from @jorisvandenbossche in the issue on the pyarrow side.

In other words, in using pyarrow for tests, we'd be testing pyarrow's interface, not the C-APIs. So long as this issue is regarding the C-APIs more generally, I suggest we test these directly (unless we foresee a danger to exposing these functions, in which case I agree pyarrow is a better option).

Edit: FWIW I also get some errors at the extremes from the pyarrow side, e.g., pa.scalar(timedelta(seconds=86399999913600)) raises an error while that timedelta is a valid pytimedelta. So it seems this fix is more general than pyarrow, but less general than the entirety of pytimedelta inputs.

I see. Generally we don't have any testing with Pythons C APIs, so I'm not sure how much pandas can guarantee compatibility. I just suggested the test case in the PyArrow issue to test a pragmatic real world use case

BenjaminHelyer · 2023-09-20T13:14:49Z

If someone can confirm that the CI docs/mypy failures are due to my changes, let me know and I'll address. Right now I can't tell how these are related, and the same two failures appear to be failing on other recent PRs, so I'm going to pause on spending time root causing them until someone can confirm otherwise.

WillAyd · 2023-09-20T20:00:49Z

pandas/_libs/tslibs/timedeltas.pyx

+# whereas -999999999 days, -86399 seconds is not in the range
+
+# upper bound for unit seconds: 1000000000 days * 86400 s/day
+cdef int64_t PYTIMEDELTA_UPPER_S = 86400000000000


are these importable from Python.h? Or where did these come from?

Thanks for pointing that out -- found the constant in the Python datetime module today and just made a new commit that uses it. Left comments and examples as they are as I think it's illustrative to show how large these bounds are. Let me know if you have further thoughts on how this was done.

Realized I'm doing this in the wrong way...trying a couple of alternatives now to import from CPython headers.

Finally realized that the constant is not defined in datetime.h, but in some other c file in CPython. Defined it in the exact same way in this pandas file, but we can always squash this revision to just use magic numbers as in the first set of commits.

If there's a way I could import the constant from that c file in CPython, do let me know! I just can't think of or find a clear cut way to do that.

WillAyd · 2023-09-20T20:01:54Z

pandas/_libs/tslibs/timedeltas.pyx

+        if value < PYTIMEDELTA_UPPER_MS and value >= PYTIMEDELTA_LOWER_MS:
+            td_base = _Timedelta.__new__(cls, milliseconds=int(value))
+        else:
+            td_base = _Timedelta.__new__(cls, milliseconds=0)


Wouldn't we just want to raise before sending to the C-API for pytimedeltas? Using 0 here is a bit arbitrary

Good question. I left it as zero thinking that might help with backwards-compatibility, since that was the previous behavior. I think (though I may be wrong, please correct me) that pandas supports a wider range of timedelta values than python, so we don't want to break existing code that leverages that feature of pandas in an attempt to make these python C-APIs work.

Perhaps we raise a warning rather than an error? Quietly sending a zero was part of the problem with the previous implementation, so perhaps we should at least raise a warning showing that the zero behavior is intentional.

I coded up a few possibilities with a warning, then ended up scratching them all due to the potential confusion to users. So how about this simple implementation? Or if this sparks any different ideas for you, I'm all ears!

Hmm I still don't really understand the point of setting to 0 - wouldn't we want to just let these raise? If you do so you can really simplify your test cases as well; just set up scenarios that we know should and shouldn't exceed the bounds and test the constructor against those

@jorisvandenbossche for thoughts in case I am overlooking something

I'm hesitant to let them raise given the original comment in the code, which reads:

# For millisecond and second resos, we cannot actually pass int(value) because # many cases would fall outside of the pytimedelta implementation bounds.

@jorisvandenbossche also alluded to cases in which pd.Timedelta allows for values outside of the range allowed by pytimedelta, but I've been unable to verify if any of these actually exist.

Agree that we need further clarity -- if there's no supported case in pandas in which a pytimedelta error would be raised, then we should just let the pytimedelta errors raise. But the comment in the code and Joris' original description make me concerned that there might be such cases, in which case we want to allow Pandas to proceed on without pytimedelta raising an error.

I think (though I may be wrong, please correct me) that pandas supports a wider range of timedelta values than python

Correct. Non-nanosecond support for timestamps/timedeltas means we can create Timestamps/Timedeltas that the stdlib datetime/timedelta cannot handle. Passing zero here was the best idea I had, but I'm open to other options. I don't think "just raise" is compatible with current non-nano support.

Our timedelta still inherits from the CPython timedelta right? If so, I think we need to first figure out a way to decouple that if we plan on supporting things outside of the range as a precursor. My main concern is that we'd be fighting the inheritance tree otherwise

Timedelta subclasses pytimedelta yes. But the C-API does not respect subclasses, so we aren't doing anything wrong inheritance tree-wise

…nternally in cython.

WillAyd · 2023-09-21T14:36:49Z

pandas/_libs/tslibs/timedeltas.pyx

+
+# the relevant constant is defined in CPython _datetimemodule.c,
+# rather than datetime.h, so we have to define it again here
+cdef int64_t MAX_DELTA_DAYS = 999999999


OK so I see these are located in "_datetimemodule.c" so its not part of the CPython API. I don't think we should be doing this at all then, as we are mixing into CPython internals.

It looks like CPython anyways already does bounds checking when creating a timedelta, so we should just rely on that.

https://github.com/python/cpython/blob/d4cea794a7b9b745817d2bd982d35412aef04710/Modules/_datetimemodule.c#L1101

Makes sense. The key here is to avoid breaking cases in which pd.Timedelta allows for values outside the bounds of pytimedelta, as referenced by the original code comment. I’m thinking we try to pass the value, and pass zero upon an OverflowError from CPython (possibly raising a warning). Is that along the lines of what you're thinking?

Loosely. I'm still probably missing the bigger picture, but I think at least catching the OverflowError is headed in the right direction. So worth trying that instead and see where that takes us

…-coded check for pytimedelta bounds.

jbrockmendel · 2023-09-23T20:10:50Z

Catching up a bit, im skeptical that it is possible to make the python C API work with non-nanosecond Timedeltas. (same issues will apply to Timestamps).

BenjaminHelyer · 2023-09-24T00:35:29Z

Catching up a bit, im skeptical that it is possible to make the python C API work with non-nanosecond Timedeltas. (same issues will apply to Timestamps).

Are you saying that we can make the C-API work with non-nanosecond Timedeltas, or that we can't make it work?

If it's the former, I'm all ears for a broader way to do this. If it's the latter, I may be misunderstanding the issue. Some of my test cases (specifically the ones in test_timedelta_as_unit_conversion) do successfully transform a non-nanosecond unit into one that is within the bounds of the C-APIs. The C-APIs are then accessed to confirm that the Timedelta was successfully transformed (and not just zero). This was to address the concern of the original bug; I also tested it with pyarrow to the same effect. Is this different from the behavior that we're concerned about?

jbrockmendel · 2023-09-24T14:53:08Z

Are you saying that we can make the C-API work with non-nanosecond Timedeltas, or that we can't make it work?

I believe we can not make the C-API work in the general case. Downstream libraries will need to use the python API.

mroeschke · 2023-09-25T17:46:14Z

I think this PR should handle the "best effort" compatibility with the C-API for now.

… as well as using a Timedelta to assert on the last two tests.

BenjaminHelyer · 2023-09-25T23:51:26Z

Forgive me for asking a potentially dumb question here, but what if we only support large, low-resolution values within the bounds of pytimedeltas, while supporting small, higher-resolution values outside the bounds of pytimedeltas (i.e., nanoseconds and smaller)? Just because we can create larger low-resolution values doesn't necessarily mean we should support them.

I ask this mostly because I foresee many use cases for nanosecond and sub-nanosecond timedeltas (I myself have run into them "in the wild"), but I don't foresee very many use cases for large, low-resolution timedeltas greater than the order of 2 million years. Perhaps in fields like astrophysics and geology this would be important, but at that point we might as well consider whether it's worth it to support even lower resolution timedeltas which are as long as the age of the universe.

github-actions · 2023-10-27T00:05:22Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

BenjaminHelyer · 2023-11-02T20:54:46Z

Bumping this issue awake again: do we want to continue pressing forward on a solution to #54682 with this PR? Or do we want the conclusion to be that, since downstream libraries need to use the python API, we won't support the C-APIs for timedeltas?

Would appreciate any thoughts from @jbrockmendel and @mroeschke . If we don't want to move forward with a fix in Pandas, we should probably close #54682 and move further discussion over to the PyArrow side.

jbrockmendel · 2023-11-02T21:06:25Z

AFAICT what is being asked is impossible and this should be closed.

BenjaminHelyer · 2023-11-02T21:12:00Z

AFAICT what is being asked is impossible and this should be closed.

That aligns with my interpretation + learnings from the code as well. I think we can fix this specific issue with PyArrow, but the general problem would still remain.

@mroeschke unless you have significant objections, I'll close this PR and then the issue on the Pandas side can be closed. The fix can then be continued on the PyArrow side.

mroeschke · 2023-11-02T22:18:03Z

I don't have objection to deem this out of scope for pandas. Even though Timstamp/Timedelta subclass the associated Python objects, we have no compatibility testing with C-APIs. IMO it would still be nice to have compatibility, but I will defer to @jbrockmendel in terms of feasibility.

BenjaminHelyer added 7 commits September 18, 2023 22:12

Exposed C-APIs for testing. Added minimal test, initially failing, wh…

ab7ee77

…ich encapsulates the bug. Outlined a couple more tests.

Exposed _get_unit_from_dtype in timdeltas.pyx for testing purposes. I…

45d91c3

…mplemented more test cases.

Added minimum working code to pass test cases to timedeltas.pyx. Adju…

9e251cc

…sted two test cases for ms and added several test cases for us and ns.

Adjusted timedeltas.pyx and added tests for inclusive nature of lower…

788c2da

… bound of pytimedeltas. Moved constants in timedeltas.pyx to non-magic number constants at top of file.

Revised comments to reflect bugfix.

89e875b

Merge branch 'main' of https://github.com/pandas-dev/pandas into bugf…

27c6dac

…ix-non-nano-timedelta-c-api

Updated docs.

7761a08

BenjaminHelyer requested a review from MarcoGorelli as a code owner September 20, 2023 04:18

BenjaminHelyer commented Sep 20, 2023

View reviewed changes

mroeschke requested a review from jbrockmendel September 20, 2023 16:58

WillAyd reviewed Sep 20, 2023

View reviewed changes

BenjaminHelyer added 3 commits September 20, 2023 20:49

Changed bound constants to be based on python.h constants.

ca933ac

Removed import for MAX_DELTA_DAYS as it appears its already defined i…

d6533df

…nternally in cython.

Defined MAX_DELTA_DAYS explicitly.

b285d35

WillAyd reviewed Sep 21, 2023

View reviewed changes

Added try/except block for exceeding pytimedelta bounds. Removed hard…

303747c

…-coded check for pytimedelta bounds.

Changed tests to check pyarrow functionality (original source of bug)…

28947b9

… as well as using a Timedelta to assert on the last two tests.

github-actions bot added the Stale label Oct 27, 2023

BenjaminHelyer closed this Nov 3, 2023

BenjaminHelyer mentioned this pull request Nov 3, 2023

BUG: support non-nanos Timedelta objects in Python C API #54682

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: support non-nanos Timedelta objects in Python C API #55213

BUG: support non-nanos Timedelta objects in Python C API #55213

BenjaminHelyer commented Sep 20, 2023

BenjaminHelyer Sep 20, 2023

mroeschke Sep 25, 2023

BenjaminHelyer Sep 25, 2023

BenjaminHelyer Sep 25, 2023 •

edited

Loading

mroeschke Sep 26, 2023

BenjaminHelyer commented Sep 20, 2023

WillAyd Sep 20, 2023

BenjaminHelyer Sep 21, 2023

BenjaminHelyer Sep 21, 2023

BenjaminHelyer Sep 21, 2023

WillAyd Sep 20, 2023

BenjaminHelyer Sep 21, 2023

BenjaminHelyer Sep 22, 2023

WillAyd Sep 22, 2023

BenjaminHelyer Sep 22, 2023

jbrockmendel Sep 23, 2023

WillAyd Sep 25, 2023

jbrockmendel Sep 25, 2023

WillAyd Sep 21, 2023

BenjaminHelyer Sep 21, 2023

WillAyd Sep 21, 2023

jbrockmendel commented Sep 23, 2023

BenjaminHelyer commented Sep 24, 2023

jbrockmendel commented Sep 24, 2023

mroeschke commented Sep 25, 2023

BenjaminHelyer commented Sep 25, 2023 •

edited

Loading

github-actions bot commented Oct 27, 2023

BenjaminHelyer commented Nov 2, 2023

jbrockmendel commented Nov 2, 2023

BenjaminHelyer commented Nov 2, 2023

mroeschke commented Nov 2, 2023

BUG: support non-nanos Timedelta objects in Python C API #55213

BUG: support non-nanos Timedelta objects in Python C API #55213

Conversation

BenjaminHelyer commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminHelyer Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminHelyer commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 23, 2023

BenjaminHelyer commented Sep 24, 2023

jbrockmendel commented Sep 24, 2023

mroeschke commented Sep 25, 2023

BenjaminHelyer commented Sep 25, 2023 • edited Loading

github-actions bot commented Oct 27, 2023

BenjaminHelyer commented Nov 2, 2023

jbrockmendel commented Nov 2, 2023

BenjaminHelyer commented Nov 2, 2023

mroeschke commented Nov 2, 2023

BenjaminHelyer Sep 25, 2023 •

edited

Loading

BenjaminHelyer commented Sep 25, 2023 •

edited

Loading