-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[READY] perf improvements for strftime #51298
base: main
Are you sure you want to change the base?
Conversation
…string formatting
…nvert_strftime_format` (raises `UnsupportedStrFmtDirective`). This function converts a `strftime` date format string into a native python formatting string
… use faster datetime formatting.
…faster datetime formatting. `_format_native_types` modified with this new argument too. Subclasses modified to support it (`DatetimeArray`, `PeriodArray`, `TimedeltaArray`, `DatetimeIndex`)
… argument `fast_strftime` to use faster datetime formatting.
…nit__`: new boolean argument `fast_strftime` to use faster datetime formatting.
…ure/44764_perf_issue_new � Conflicts: � pandas/_libs/tslibs/period.pyx � pandas/io/formats/format.py � pandas/tests/scalar/test_nat.py
…ure/44764_perf_issue_new
This PR is still pretty big. Any reason why you are introducing a new |
…ired by `Period.fast_strftime` and `Timestamp.fast_strftime`
…ure/44764_perf_issue_new � Conflicts: � pandas/tests/frame/methods/test_to_csv.py
Sorry for the delay in answering those last comments @MarcoGorelli and @WillAyd.
Hopefully this will now be ok for both of you. |
thanks for updating!
agree, but it's not part of the public-facing API - @WillAyd are you OK with having an argument to trigger the fast path just in internal functions? |
Haven't done a full deep dive on the latest changes yet but at a high level seems reasonable, though maybe would opt for just Would that alleviate your concern @MarcoGorelli |
sure, but there are cases when the so, then, pandas could either:
|
I thought this was just managed internally - wouldn't we be able to just try one engine and if it raises a NotImplementedError fallback to the other? Hoping ideally we don't make it a user-configurable option to choose the engine but just internally decide on whats best |
@WillAyd the current Pull Request is precisely doing what you suggest. Users do not see an additional parameter in the API. We try the fast python engine and if the strftime pattern is not supported, we switch to the C engine. I believe that it can be merged "as is" and provides massive speed improvements. Now in the future, there are inconsistencies to fix between the two engines. These inconsistencies already exist (even if we do not merge the current PR). Indeed in current pandas the python string engine is already used for the default formats. This is why I propose to introduce a public API parameter named @WillAyd I believe that we can merge current PR first, and then adress the other one with the engine parameter. Would you be ok with this ? Please let's try to use the upcoming 2 weeks (maximum) to finalize this work started almost 3 years ago (#44764) |
hi @smarie - I think Will might be asking why |
@WillAyd do you confirm that removing the If you would like more changes to be done in this PR, do not hesitate to let me know now, so that I can tackle all changes in the same update. Indeed multiple very small changes induce a large overhead on my side unfortunately (dev env, pre-commit checks, ci/cd, tidying everything, etc.). Thanks in advance for your understanding and help to get this feature out ! |
@WillAyd @MarcoGorelli , is there any way we could wrap up this ? We are very, very close to a success here - let me know how you want to proceed |
I admittedly have not had a ton of time to look at this lately, but my major hang up is still the end API that we are marching towards and how that impacts the current implementation. The fact that the "python" implemenation is faster than the "c" implementation is backwards from the rest of our API, and I'm still not sure we really even need to offer those options and maintain all of this. The other consideration point is pyarrow - is this faster than that? From the descriptions provided, it seems like we are still heavily using Python strings, which I would think is a bottleneck. I expect it would be much simpler just to dispatch to pyarrow.compute and circumvent this altogether? |
Thanks @WillAyd for this feedback !
Indeed, engine names proposals
Do you have examples PRs merged in the past where the pandas team managed to replace an existing table computation function with a pyarrow compute function ? This could help me understand how you would like this to be done. My recommendation still, would be to be very pragmatic here, we cannot do everything at once: a first step is to have an API to select engines, introducing a faster default engine with str.format. A second step will be to evaluate pyarrow strftime as an engine. I agree with you that this has potential to replace the OS strftime, but I'm afraid of the time and effort that it could take so I am reluctant to open the pandora box again and enter an 8 months-long contribution and review period without guarantee of success... |
In order to try to answer to your question about expected performance with pyarrow : I started to look at apache arrow internals and I could not yet find evidence that the arrow engine parses the string format specifier once upfront, into an efficient representation, to avoid re-parsing it for every cell of the TimestampArray. If you are more at ease with arrow codebase please do not hesitate to guide me here. I found this :https://github.com/search?q=repo%3Aapache%2Farrow%20TimestampFormatter&type=code |
@smarie if you are asking about where pyarrow implements this feature, it's in the C++ core: The formatter is created once up front on L1199 before being applied element wise (see L1211-1216). In between there is also some heuristic for pre-allocating a string buffer (L1204-L1208), so my expectation is that it's performance will be hard to beat
pandas/core/arrays/string_arrow.py uses pyarrow.compute rather heavily, if pyarrow is installed. I think doing something similar here would be preferable
I understand and am really sympathetic to this. I don't think I've explicitly said this but THANK YOU for your time and effort on this. The problem is though, that this works both ways. Larger PRs take a lot of time to write (as you've noticed) but also take a very long time to review (as you've seen from us), while posing a larger maintenance risk. Generally, I'd suggest getting alignment on the discussion and planning portion of a large, multi-file, API-changing initiative before diving in, as that makes the process more tenable for all parties involved |
Thanks @WillAyd .
Agreed. Well, at first (as always ? :) ) it seemed like a straightforward, simple change to do. But as I discovered in the past 2 years implementing this (PR reviews started with @jbrockmendel and @mroeschke in 2022), many intermediate issues were hiding / smaller steps could be made. I therefore solved first #46361, #46759, #47570, #46405, #53003, #51459 . Unfortunately even now, this PR is still large and impacts a lot of files. Surprisingly, this is not due to the fact that the proposed engine is custom - indeed the proposed engine is just a single small file (pandas/_libs/tslibs/strftime.py). It is due to the fact that pandas implementation of datetime and period formatting is still quite layered and with a few design inconsistencies. It is still much better than a couple months ago, as the team (@mroeschke I believe) simplified many useless alternate formatting functions. Here is a global picture of files impacted by this PR In addition, 8 test files, 3 asv benchmark files, and a couple init/api/meson files are updated. As you can hopefully see if you use this picture to navigate the PR contents, adding the pyarrow engine would not change much of the complexity here - hence the PR would still be hard to review. Based on this map, could we define together a target implementation strategy ? I am ready to break this PR into bits if you believe that this would make it easier to review. Note that the "soon" word is important in the above sentence. Indeed I like pragmatism: if we can bring significant performance improvements to users now, it is probably better than waiting years expecting a future "big refactoring" (replacement of all pandas internals with pyarrow's). Thanks again a lot for the time you dedicate to this |
Thanks @smarie - that diagram and thought process is great. I'm less familiar with this part of the code base than the people you have already pinged, but it seems like this issue is the main hindrance Maybe answering that is the right approach forward - is there a particular design pattern we can apply that helps untangle that web? |
To answer this question let's first ask ourselves "which API is mandatory":
Now I see two things that we should improve to ease maintainance and evolutions:
So for (2) I would recommend to move all of |
@WillAyd any thoughts or feedback yet about the above plan ? |
Those seem like good suggestions
This entire part seems to be in a tangle. So we have a mix of class methods and free-standing functions being responsible for the formatting? I'm not sure that going back to the class-method approach is the best; maybe there should just be a single class responsible for Strftime formatting the various objects to help consolidate the logic and follow a better separation of concerns There's probably some history to the mixed design approaches that @MarcoGorelli and @jbrockmendel can offer guidance on |
Thanks @WillAyd . Could someone (@MarcoGorelli, @jbrockmendel ) confirm that I can open a (distinct) PR for 1. ? Also could any of you refine or confirm item 2. ? I can definitely code any of the above proposals, but as advised by @WillAyd I would rather now make sure "in advance" that this has chances to match your expectations so as to be merged smoothly thanks again ! |
sure, no objections from my side |
The formatter seems to be a wrapper around So I guess it's still called on each element. |
if tz_aware: | ||
self.data["dt"] = self.data["dt"].dt.tz_localize("UTC") | ||
self.data["d"] = self.data["d"].dt.tz_localize("UTC") | ||
|
||
self.data["i"] = self.data["dt"] | ||
self.data.set_index("i", inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter tz_aware is being used to toggle between timezone-aware and naive timestamps. However, the logic for handling timezone-aware data in the setup method is somewhat repetitive and could benefit from encapsulation in a utility function.
Extract the logic for timezone localization into a helper function to improve readability and maintainability.
For example:
`
def localize_if_required(dataframe, tz_aware):
if tz_aware:
dataframe["dt"] = dataframe["dt"].dt.tz_localize("UTC")
dataframe["d"] = dataframe["d"].dt.tz_localize("UTC")
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the documentation to include an example of when the fallback to OS-level strftime occurs and its performance implications.
Thanks @auderson and @Parvezkhan0 , this will provide some motivation fuel for taking a new stab at this :) most probably in the quieter period after end of year |
This PR is a new clean version of #46116
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.