ENH: Make performance warnings opt-in and more noisy #55385

rhshadrach · 2023-10-03T21:34:06Z

At times, I think our performance warnings can be too noisy. For example, if I'm doing an ad hoc analysis on a small data set, I don't care about performance. Other times, I purposefully have a MultiIndex with a particular order because it's important to the analysis I'm doing, and working with the data produces lexsort warnings.

One idea is to make performance warnings opt-in (via an option) and more noisy. The idea here would be that a user has written a piece of code they're happy with, and then they can enable performance warnings to see if pandas thinks they're being ill-performant. Doing this, we can start emitting more warnings to try to nudge users to more performant options.

Some cases where we could warn:

Using agg / apply / transform with a Python callable (a UDF)
Using merge on columns rather than indices if the index would be unique
Using .copy() with Copy-on-Write enabled (ref: PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write #51463 (comment))

cc @Dr-Irv @phofl @jbrockmendel @jorisvandenbossche

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2023-10-04T14:39:20Z

Nice idea. Would we have separate options for each type of warning?

rhshadrach · 2023-10-07T15:41:15Z

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

attack68 · 2023-10-08T17:11:23Z

I think this applicable to other sorts of warning too, to be honest.

In my own application I have tests that fail on warnings and do not want to pass warnings through to users for backend code. So I end up implementing it both ways, for example:

if version.parse(pd.__version__) >= version.parse("2.1.0"):
    # applymap issues a deprecation warning with version <2.1.0
    # TODO (low): clean this up when setting a minimum pandas version at 2.1.0
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].map(_map_true_false)
else:
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].applymap(_map_true_false)
return df

Probably there are better ways to handle this but sometimes the most obvious patches are more efficient to implement (timewise).
I would prefer if I could be made aware of the warning in my backend code, but turn them off in production, and I don't really want to implements some kind of intermediate warnings catcher.

rhshadrach · 2023-10-08T19:01:50Z

I think this applicable to other sorts of warning too, to be honest.

I think I'd be against enabling users to disable FutureWarnings via an option. That sounds like a footgun.

Dr-Irv · 2023-10-09T13:32:39Z

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

Consider the 3 examples that you list above. You could have a separate option for turning on/off the PerformanceWarning for each of those examples.

rhshadrach · 2023-10-09T20:30:36Z

It sounds like you have a use case in mind that is at odds with what I'm proposing here, but I'm not sure what that use case is. To be sure, what I'm proposing is that users don't typically run with PerformanceWarning enabled - both in the development of new code and in repeated runs of existing. Rather, a user has a piece of code they want to make more performant, they enable the PerformanceWarnings to see where pandas makes suggestions, modify the code, and then disable PerformanceWarnings as they run it in the future.

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

Dr-Irv · 2023-10-09T20:44:45Z

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

Let's say I only want to find where I can remove .copy(). So I don't care about other performance warnings. I just want to know where I can get rid of those extra defensive .copy() calls.

rhshadrach · 2023-10-09T20:46:52Z

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

Dr-Irv · 2023-10-09T21:26:50Z

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

That makes sense.

jorisvandenbossche · 2023-10-11T15:58:23Z

To have an idea about the current warnings we might be talking about, a list of PerformanceWarnings I found with a quick search:

"Adding/subtracting object-dtype array to [DatetimeLikeArray] not vectorized." (note: I think this is typically for DateOffsets)
"Non-vectorized DateOffset being applied to Series or DatetimeIndex." (similar as above)
"Falling back on a non-pyarrow code path which may decrease performance." (for some corner case features in our string methods)
For SparseArray:
- "searchsorted requires high memory usage."
- "Concatenating sparse arrays with multiple fill values"
MultiIndex:
- "dropping on a non-lexsorted multi-index without a level parameter may impact performance."
- "indexing past lexsort depth may impact performance."
"DataFrame is highly fragmented. ..."
unstack: "The following operation may generate {num_cells} cells in the resulting pandas object."
align: "Alignment difference on axis {axis} is larger than an order of magnitude on term {repr(terms[i].name)}, by more than {ordm:.4g}; performance may suffer." (never seen this one ;))

I think I am +1 on the proposal. In any case it would be nice to have an easier way to turn them off.

arnaudlegout · 2023-11-14T08:33:34Z

I would also add all pandas methods that have a pure Python implementation that is known to be slow. Even if there is no alternative in pandas, there might be an alternative with an optimized implementation in another library. Getting this warning will at least give a hint where my code could be slow.

Here are other suggestions. I do not claim they must lead to a PerformanceWarning, but that it is practical performance issues in my code that I spent time finding and fixing. Having a suggestion (with a PerformanceWarning) would have saved me a lot of time.

np.isin is faster than Series.isin
df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()
array[row, col] (array being a numpy array) is much faster (250x) than df.at[row, col] (assuming you can convert you df to a numpy array, basically, all columns are of the same numpy dtype)

phofl · 2023-11-14T09:11:54Z

@arnaudlegout These examples aren't black and white, the array calls will be slower depending on what dtypes you have, for example the last one would trigger a copy if your dataframe has multiple dtypes and you convert to a numpy array

arnaudlegout · 2023-11-14T09:49:51Z

@phofl right. I did not intend to say they are clear perf warning candidates. When you implement a method, you know you have certain code paths that are suboptimal, in that case, it could be nice to add a PerformanceWarning to make clear to the user you are in a slow code path execution.

As another perf warning candidate, there is a groupby on a non-sorted index vs. a sorted index.

jorisvandenbossche · 2023-11-15T07:17:55Z

df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

arnaudlegout · 2023-11-15T15:20:28Z

df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

I opened issue #55972
In my reproducible example, unique.apply(len) is 3x faster than nunique on a Series.groupby

bionicles · 2024-01-15T22:54:06Z

Just throwing it out there, detecting testing time, e.g.:

if "PYTEST_CURRENT_TEST" in os.environ:

Would be a wonderful time to turn on these warnings.

Second, the regex capturing groups warning counts because it compiles a regex pattern every call to .str.contains to warn us about unused capturing groups. I use named groups to organize complicated regex spaghetti, so this warning just gets ignored. Compiling a pattern every time just to throw it away is a waste of time.

IMHO it could be completely removed. The issue is here: #56798 and my PR to delete it didn't get merged but it's an easy fix.

rhshadrach added Enhancement Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Warnings Warnings that appear or should be added to pandas labels Oct 3, 2023

rhshadrach changed the title ~~API: Make performance warnings opt-in and more noisy~~ ENH: Make performance warnings opt-in and more noisy Oct 3, 2023

rhshadrach mentioned this issue Nov 13, 2023

PDEP-8: Inplace methods in pandas #51466

Open

rhshadrach mentioned this issue Jan 9, 2024

BUG: "str.contains" match groups UserWarning discourages best practices & slows performance #56798

Open

3 tasks

rhshadrach mentioned this issue Jan 17, 2024

ENH: Allow users to disable PerformanceWarning #56920

Closed

rhshadrach mentioned this issue May 5, 2024

ENH: Allow performance warnings to be disabled #56921

Merged

5 tasks

rhshadrach mentioned this issue Sep 9, 2024

String dtype: remove fallback Perfomance warnings for string methods #59760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Make performance warnings opt-in and more noisy #55385

ENH: Make performance warnings opt-in and more noisy #55385

rhshadrach commented Oct 3, 2023 •

edited

Loading

Dr-Irv commented Oct 4, 2023

rhshadrach commented Oct 7, 2023

attack68 commented Oct 8, 2023

rhshadrach commented Oct 8, 2023 •

edited

Loading

Dr-Irv commented Oct 9, 2023

rhshadrach commented Oct 9, 2023 •

edited

Loading

Dr-Irv commented Oct 9, 2023

rhshadrach commented Oct 9, 2023

Dr-Irv commented Oct 9, 2023

jorisvandenbossche commented Oct 11, 2023

arnaudlegout commented Nov 14, 2023 •

edited

Loading

phofl commented Nov 14, 2023

arnaudlegout commented Nov 14, 2023

jorisvandenbossche commented Nov 15, 2023

arnaudlegout commented Nov 15, 2023

bionicles commented Jan 15, 2024 •

edited

Loading

ENH: Make performance warnings opt-in and more noisy #55385

ENH: Make performance warnings opt-in and more noisy #55385

Comments

rhshadrach commented Oct 3, 2023 • edited Loading

Dr-Irv commented Oct 4, 2023

rhshadrach commented Oct 7, 2023

attack68 commented Oct 8, 2023

rhshadrach commented Oct 8, 2023 • edited Loading

Dr-Irv commented Oct 9, 2023

rhshadrach commented Oct 9, 2023 • edited Loading

Dr-Irv commented Oct 9, 2023

rhshadrach commented Oct 9, 2023

Dr-Irv commented Oct 9, 2023

jorisvandenbossche commented Oct 11, 2023

arnaudlegout commented Nov 14, 2023 • edited Loading

phofl commented Nov 14, 2023

arnaudlegout commented Nov 14, 2023

jorisvandenbossche commented Nov 15, 2023

arnaudlegout commented Nov 15, 2023

bionicles commented Jan 15, 2024 • edited Loading

rhshadrach commented Oct 3, 2023 •

edited

Loading

rhshadrach commented Oct 8, 2023 •

edited

Loading

rhshadrach commented Oct 9, 2023 •

edited

Loading

arnaudlegout commented Nov 14, 2023 •

edited

Loading

bionicles commented Jan 15, 2024 •

edited

Loading