Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] perf improvements for strftime #51298

Open
wants to merge 154 commits into
base: main
Choose a base branch
from

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Feb 10, 2023

This PR is a new clean version of #46116

Sylvain MARIE added 26 commits February 6, 2023 15:44
…nvert_strftime_format` (raises `UnsupportedStrFmtDirective`). This function converts a `strftime` date format string into a native python formatting string
…faster datetime formatting. `_format_native_types` modified with this new argument too. Subclasses modified to support it (`DatetimeArray`, `PeriodArray`, `TimedeltaArray`, `DatetimeIndex`)
… argument `fast_strftime` to use faster datetime formatting.
…nit__`: new boolean argument `fast_strftime` to use faster datetime formatting.
…ure/44764_perf_issue_new

� Conflicts:
�	pandas/_libs/tslibs/period.pyx
�	pandas/io/formats/format.py
�	pandas/tests/scalar/test_nat.py
@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2023

This PR is still pretty big. Any reason why you are introducing a new fast_strftime keyword instead of just trying to improve performance inplace? I think that would help to reduce the size, though still probably need to break up in smaller subsets. The bigger a PR is, the harder it is to review so ends up in a long review cycle

Sylvain MARIE added 3 commits February 14, 2023 09:50
…ired by `Period.fast_strftime` and `Timestamp.fast_strftime`
…ure/44764_perf_issue_new

� Conflicts:
�	pandas/tests/frame/methods/test_to_csv.py
@smarie
Copy link
Contributor Author

smarie commented Jun 30, 2024

Sorry for the delay in answering those last comments @MarcoGorelli and @WillAyd.

  • I simplified the characters escaping procedures @MarcoGorelli ,
  • I removed the name "fast_strftime" everywhere, replacing it with "strftime_pystr" or "use_pystr_engine" depending on the location @WillAyd .

Hopefully this will now be ok for both of you.

@smarie smarie requested a review from MarcoGorelli June 30, 2024 21:10
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 10, 2024

thanks for updating!

from an API perspective I really dislike fast_strfrtime

agree, but it's not part of the public-facing API - @WillAyd are you OK with having an argument to trigger the fast path just in internal functions?

@WillAyd
Copy link
Member

WillAyd commented Jul 10, 2024

Haven't done a full deep dive on the latest changes yet but at a high level seems reasonable, though maybe would opt for just engine="python" | "c" since that is a precedent we have in other methods (granted, C is normally the faster option)

Would that alleviate your concern @MarcoGorelli

@MarcoGorelli
Copy link
Member

sure, but there are cases when the 'c' engine isn't supported

so, then, pandas could either:

  • fallback to engine='python'
  • raise, telling the user to pass engine='python'
  • use engine='python' as default

@WillAyd
Copy link
Member

WillAyd commented Jul 10, 2024

I thought this was just managed internally - wouldn't we be able to just try one engine and if it raises a NotImplementedError fallback to the other?

Hoping ideally we don't make it a user-configurable option to choose the engine but just internally decide on whats best

@smarie
Copy link
Contributor Author

smarie commented Jul 10, 2024

I thought this was just managed internally - wouldn't we be able to just try one engine and if it raises a NotImplementedError fallback to the other?

@WillAyd the current Pull Request is precisely doing what you suggest. Users do not see an additional parameter in the API. We try the fast python engine and if the strftime pattern is not supported, we switch to the C engine. I believe that it can be merged "as is" and provides massive speed improvements.

Now in the future, there are inconsistencies to fix between the two engines. These inconsistencies already exist (even if we do not merge the current PR). Indeed in current pandas the python string engine is already used for the default formats. This is why I propose to introduce a public API parameter named 'engine'. Please see my proposal in #58179 under "feature description". I mention 3 parameter values: 'pystr' (equivalent of 'python'), 'os' (equivalent of 'c'), and 'auto' (default). The latter represents the fallback behaviour described above (where pystr is tried first, and os is used as a fallback)

@WillAyd I believe that we can merge current PR first, and then adress the other one with the engine parameter. Would you be ok with this ? Please let's try to use the upcoming 2 weeks (maximum) to finalize this work started almost 3 years ago (#44764)

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 16, 2024

hi @smarie - I think Will might be asking why _use_pystr_engine is even an argument of format_array_from_datetime, if it's always set to True when calling it. I think it could just be removed in this PR, and format_array_from_datetime determines whether to use the fastpath or not?

@smarie
Copy link
Contributor Author

smarie commented Jul 16, 2024

@WillAyd do you confirm that removing the _use_pystr_engine parameter from format_array_from_datetime would do the trick for you as mentioned by @MarcoGorelli ? I would prefer to be sure that this is the last and only change remaining for this PR. Once merged, I will switch to implementing the "engine" parameter mentioned in #58179

If you would like more changes to be done in this PR, do not hesitate to let me know now, so that I can tackle all changes in the same update. Indeed multiple very small changes induce a large overhead on my side unfortunately (dev env, pre-commit checks, ci/cd, tidying everything, etc.).

Thanks in advance for your understanding and help to get this feature out !

@smarie
Copy link
Contributor Author

smarie commented Aug 29, 2024

@WillAyd @MarcoGorelli , is there any way we could wrap up this ? We are very, very close to a success here - let me know how you want to proceed

@WillAyd
Copy link
Member

WillAyd commented Sep 3, 2024

I admittedly have not had a ton of time to look at this lately, but my major hang up is still the end API that we are marching towards and how that impacts the current implementation. The fact that the "python" implemenation is faster than the "c" implementation is backwards from the rest of our API, and I'm still not sure we really even need to offer those options and maintain all of this.

The other consideration point is pyarrow - is this faster than that? From the descriptions provided, it seems like we are still heavily using Python strings, which I would think is a bottleneck. I expect it would be much simpler just to dispatch to pyarrow.compute and circumvent this altogether?

@smarie
Copy link
Contributor Author

smarie commented Sep 3, 2024

Thanks @WillAyd for this feedback !

The fact that the "python" implemenation is faster than the "c" implementation is backwards from the rest of our API

Indeed, engine names proposals py_str_template (faster) vs. c_strftime (slower) is counter intuitive. I think that this API consistency can be solved simply by using better names:

  • for the faster version leveraging stdlib string templating: cpy_str_template (referencing to the fact that the python stdlib is actually implemented in C and optimized for speed), stdlib_str_template or stdlib_str_format or str_format or str.format. Note that any remaining ambiguity can be easily solved by a clear docstring
  • for the slower current version leveraging OS-specific implementation of strftime, maybe using os_strftime instead of c_strftime, will be easier to not relate to any speed expactation. Here again a clear docstring will do the trick to solve any remaining ambiguity about why this is not the default engine.

The other consideration point is pyarrow - is this faster than that? From the descriptions provided, it seems like we are still heavily using Python strings, which I would think is a bottleneck. I expect it would be much simpler just to dispatch to pyarrow.compute and circumvent this altogether?

pyarrow is definitely an engine worth looking at, that can be added to the list of engines available in the future, and can even become a default one if we can validate that the implementation suits all of pandas community expectations. I see that this function is available : https://arrow.apache.org/docs/python/generated/pyarrow.compute.strftime.html . Still I don't expect a miracle here - strftime is an extremely complex function, it is OS-dependent AND locale-dependent, and we will surely find corner cases similar to the ones already found in pandas.

Do you have examples PRs merged in the past where the pandas team managed to replace an existing table computation function with a pyarrow compute function ? This could help me understand how you would like this to be done.

My recommendation still, would be to be very pragmatic here, we cannot do everything at once: a first step is to have an API to select engines, introducing a faster default engine with str.format. A second step will be to evaluate pyarrow strftime as an engine. I agree with you that this has potential to replace the OS strftime, but I'm afraid of the time and effort that it could take so I am reluctant to open the pandora box again and enter an 8 months-long contribution and review period without guarantee of success...

@smarie
Copy link
Contributor Author

smarie commented Sep 3, 2024

In order to try to answer to your question about expected performance with pyarrow : I started to look at apache arrow internals and I could not yet find evidence that the arrow engine parses the string format specifier once upfront, into an efficient representation, to avoid re-parsing it for every cell of the TimestampArray. If you are more at ease with arrow codebase please do not hesitate to guide me here. I found this :https://github.com/search?q=repo%3Aapache%2Farrow%20TimestampFormatter&type=code
but this does not lead me anywhere yet

@WillAyd
Copy link
Member

WillAyd commented Sep 3, 2024

@smarie if you are asking about where pyarrow implements this feature, it's in the C++ core:

https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L1196

The formatter is created once up front on L1199 before being applied element wise (see L1211-1216). In between there is also some heuristic for pre-allocating a string buffer (L1204-L1208), so my expectation is that it's performance will be hard to beat

Do you have examples PRs merged in the past where the pandas team managed to replace an existing table computation function with a pyarrow compute function

pandas/core/arrays/string_arrow.py uses pyarrow.compute rather heavily, if pyarrow is installed. I think doing something similar here would be preferable

I'm afraid of the time and effort that it could take so I am reluctant to open the pandora box again and enter an 8 months-long contribution and review period without guarantee of success...

I understand and am really sympathetic to this. I don't think I've explicitly said this but THANK YOU for your time and effort on this.

The problem is though, that this works both ways. Larger PRs take a lot of time to write (as you've noticed) but also take a very long time to review (as you've seen from us), while posing a larger maintenance risk. Generally, I'd suggest getting alignment on the discussion and planning portion of a large, multi-file, API-changing initiative before diving in, as that makes the process more tenable for all parties involved

@smarie
Copy link
Contributor Author

smarie commented Sep 4, 2024

Thanks @WillAyd .

I'd suggest getting alignment on the discussion and planning portion of a large, multi-file, API-changing initiative before diving in, as that makes the process more tenable for all parties involved

Agreed. Well, at first (as always ? :) ) it seemed like a straightforward, simple change to do. But as I discovered in the past 2 years implementing this (PR reviews started with @jbrockmendel and @mroeschke in 2022), many intermediate issues were hiding / smaller steps could be made. I therefore solved first #46361, #46759, #47570, #46405, #53003, #51459 .
In the meantime arrow became more popular and more integrated in pandas

Unfortunately even now, this PR is still large and impacts a lot of files. Surprisingly, this is not due to the fact that the proposed engine is custom - indeed the proposed engine is just a single small file (pandas/_libs/tslibs/strftime.py). It is due to the fact that pandas implementation of datetime and period formatting is still quite layered and with a few design inconsistencies. It is still much better than a couple months ago, as the team (@mroeschke I believe) simplified many useless alternate formatting functions.

Here is a global picture of files impacted by this PR

image

In addition, 8 test files, 3 asv benchmark files, and a couple init/api/meson files are updated.

As you can hopefully see if you use this picture to navigate the PR contents, adding the pyarrow engine would not change much of the complexity here - hence the PR would still be hard to review.

Based on this map, could we define together a target implementation strategy ? I am ready to break this PR into bits if you believe that this would make it easier to review.
Alternately if you think that pandas' timestamp and period objects will soon be replaced with their arrow equivalent (and therefore all the above structures will be removed from pandas), then indeed maybe there is no need to merge anything apart from the ASV benchmarks, some of the tests, and maybe the CSVFormatter fix :)

Note that the "soon" word is important in the above sentence. Indeed I like pragmatism: if we can bring significant performance improvements to users now, it is probably better than waiting years expecting a future "big refactoring" (replacement of all pandas internals with pyarrow's).

Thanks again a lot for the time you dedicate to this

@WillAyd
Copy link
Member

WillAyd commented Sep 4, 2024

It is due to the fact that pandas implementation of datetime and period formatting is still quite layered and with a few design inconsistencies.

Thanks @smarie - that diagram and thought process is great. I'm less familiar with this part of the code base than the people you have already pinged, but it seems like this issue is the main hindrance

Maybe answering that is the right approach forward - is there a particular design pattern we can apply that helps untangle that web?

@smarie
Copy link
Contributor Author

smarie commented Sep 5, 2024

Maybe answering that is the right approach forward - is there a particular design pattern we can apply that helps untangle that web?

To answer this question let's first ask ourselves "which API is mandatory":

  • We need scalar operations Timestamp.strftime and Period.strftime and they are there already - nothing to change there except introducing an engine parameter when people will want to force use a specific engine for a scalar (for debugging purposes or to get some engine-specific implementation of some strftime directive, as they are not all standard as you know)
  • We need array operations on classes DatetimeArray, PeriodArray, DatetimeIndex, PeriodIndex. strftime is everywhere provided by DatetimeLikeArrayMixin.strftime, which delegates to cls._format_native_types. Two particular entry points exist : DatetimeArray._format_native_types and PeriodArray._format_native_types. Here again everything seems fine to me, apart from adding the engine parameter

Now I see two things that we should improve to ease maintainance and evolutions:

  1. can we get rid of get_format_datetime64, _format_datetime64, and _format_datetime64_dateonly ? That would remove the bottom part of the picture in the previous post. They seem to be used only in DatetimeIndex._formatter_func and in _Datetime64TZFormatter._format_strings. We could probably have DatetimeIndex._formatter_func leverage DatetimeIndex.strftime instead, and I'm pretty sure that _Datetime64TZFormatter does not need to override the _format_strings method inherited from _Datetime64Formatter anymore so we could just drop this overridance.

  2. then, we could try to harmonize period_array_strftime and format_array_from_datetime. It seems that the design for periods is slightly better today than the one for timestamps:

    • for periods, everything is in the same file _libs/tslib/period.pyx, and the core formatter today used both for arrays and scalars, is a function cdef period_format that leverages libc.time.strftime (os-dependent). Neither the Period class, nor the period_array_strftime function, contain any "shortcut" to perform formatting themselves, they both nicely delegate to cdef period_format.
    • while for timestamps there is one entry point for scalars in _libs/tslib/timestamps.pyx, Timestamp.strftime, which delegates to datetime.strftime (stdlib). The latter is, at least in cpython implementation, using libc - but it is distribution-dependent. In parallel as opposed to what is done for periods, arrays are handled in _libs/tslib.pyx by format_array_from_datetime, that performs some formatting on its own in some cases, and in other cases delegates to Timestamp.strftime.

So for (2) I would recommend to move all of _libs/tslib.pyx's content to _libs/tslibs/timestamps.pyx or to a dedicated _libs/tslibs/datetimes.pyx, and have the array (format_array_from_datetime) and scalar (Timestamp.strftime) methods both leverage the same (new) function cdef timestamp_format in there. Also, renaming format_array_from_datetime into datetime_array_strftime for consistency

@smarie
Copy link
Contributor Author

smarie commented Sep 23, 2024

@WillAyd any thoughts or feedback yet about the above plan ?

@WillAyd
Copy link
Member

WillAyd commented Sep 23, 2024

  1. can we get rid of get_format_datetime64, _format_datetime64, and _format_datetime64_dateonly ? That would remove the bottom part of the picture in the previous post. They seem to be used only in DatetimeIndex._formatter_func and in _Datetime64TZFormatter._format_strings. We could probably have DatetimeIndex._formatter_func leverage DatetimeIndex.strftime instead, and I'm pretty sure that _Datetime64TZFormatter does not need to override the _format_strings method inherited from _Datetime64Formatter anymore so we could just drop this overridance.

Those seem like good suggestions

2. then, we could try to harmonize period_array_strftime and format_array_from_datetime. It seems that the design for periods is slightly better today than the one for timestamps:

This entire part seems to be in a tangle. So we have a mix of class methods and free-standing functions being responsible for the formatting? I'm not sure that going back to the class-method approach is the best; maybe there should just be a single class responsible for Strftime formatting the various objects to help consolidate the logic and follow a better separation of concerns

There's probably some history to the mixed design approaches that @MarcoGorelli and @jbrockmendel can offer guidance on

@smarie
Copy link
Contributor Author

smarie commented Sep 24, 2024

Thanks @WillAyd . Could someone (@MarcoGorelli, @jbrockmendel ) confirm that I can open a (distinct) PR for 1. ?

Also could any of you refine or confirm item 2. ? I can definitely code any of the above proposals, but as advised by @WillAyd I would rather now make sure "in advance" that this has chances to match your expectations so as to be merged smoothly

thanks again !

@MarcoGorelli
Copy link
Member

Could someone (@MarcoGorelli, @jbrockmendel ) confirm that I can open a (distinct) PR for 1. ?

sure, no objections from my side

@auderson
Copy link
Contributor

auderson commented Nov 6, 2024

https://github.com/apache/arrow/blob/c455d6b8c4ae2cb22baceb4c27e1325b973d39e1/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L1196

The formatter is created once up front on L1199 before being applied element wise (see L1211-1216). In between there is also some heuristic for pre-allocating a string buffer (L1204-L1208), so my expectation is that it's performance will be hard to beat

The formatter seems to be a wrapper around arrow_vendored::date::to_stream:
https://github.com/apache/arrow/blob/c3601a97a0718ae47726e6c134cbed4b98bd1a36/cpp/src/arrow/compute/kernels/temporal_internal.h#L146C7-L146C38

So I guess it's still called on each element.
Here's the comparison as the way done in #44764

image

Comment on lines +22 to +27
if tz_aware:
self.data["dt"] = self.data["dt"].dt.tz_localize("UTC")
self.data["d"] = self.data["d"].dt.tz_localize("UTC")

self.data["i"] = self.data["dt"]
self.data.set_index("i", inplace=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter tz_aware is being used to toggle between timezone-aware and naive timestamps. However, the logic for handling timezone-aware data in the setup method is somewhat repetitive and could benefit from encapsulation in a utility function.

Extract the logic for timezone localization into a helper function to improve readability and maintainability.
For example:
`
def localize_if_required(dataframe, tz_aware):
if tz_aware:
dataframe["dt"] = dataframe["dt"].dt.tz_localize("UTC")
dataframe["d"] = dataframe["d"].dt.tz_localize("UTC")

`

Copy link

@Parvezkhan0 Parvezkhan0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the documentation to include an example of when the fallback to OS-level strftime occurs and its performance implications.

@smarie
Copy link
Contributor Author

smarie commented Nov 23, 2024

Thanks @auderson and @Parvezkhan0 , this will provide some motivation fuel for taking a new stab at this :) most probably in the quieter period after end of year

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: strftime is slow
7 participants