Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make performance warnings opt-in and more noisy #55385

Open
rhshadrach opened this issue Oct 3, 2023 · 16 comments
Open

ENH: Make performance warnings opt-in and more noisy #55385

rhshadrach opened this issue Oct 3, 2023 · 16 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Warnings Warnings that appear or should be added to pandas

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Oct 3, 2023

At times, I think our performance warnings can be too noisy. For example, if I'm doing an ad hoc analysis on a small data set, I don't care about performance. Other times, I purposefully have a MultiIndex with a particular order because it's important to the analysis I'm doing, and working with the data produces lexsort warnings.

One idea is to make performance warnings opt-in (via an option) and more noisy. The idea here would be that a user has written a piece of code they're happy with, and then they can enable performance warnings to see if pandas thinks they're being ill-performant. Doing this, we can start emitting more warnings to try to nudge users to more performant options.

Some cases where we could warn:

cc @Dr-Irv @phofl @jbrockmendel @jorisvandenbossche

@rhshadrach rhshadrach added Enhancement Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action Warnings Warnings that appear or should be added to pandas labels Oct 3, 2023
@rhshadrach rhshadrach changed the title API: Make performance warnings opt-in and more noisy ENH: Make performance warnings opt-in and more noisy Oct 3, 2023
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 4, 2023

Nice idea. Would we have separate options for each type of warning?

@rhshadrach
Copy link
Member Author

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

@attack68
Copy link
Contributor

attack68 commented Oct 8, 2023

I think this applicable to other sorts of warning too, to be honest.

In my own application I have tests that fail on warnings and do not want to pass warnings through to users for backend code. So I end up implementing it both ways, for example:

if version.parse(pd.__version__) >= version.parse("2.1.0"):
    # applymap issues a deprecation warning with version <2.1.0
    # TODO (low): clean this up when setting a minimum pandas version at 2.1.0
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].map(_map_true_false)
else:
    df[df["dtype"] == "bool"] = df[df["dtype"] == "bool"].applymap(_map_true_false)
return df

Probably there are better ways to handle this but sometimes the most obvious patches are more efficient to implement (timewise).
I would prefer if I could be made aware of the warning in my backend code, but turn them off in production, and I don't really want to implements some kind of intermediate warnings catcher.

@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 8, 2023

I think this applicable to other sorts of warning too, to be honest.

I think I'd be against enabling users to disable FutureWarnings via an option. That sounds like a footgun.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 9, 2023

Would we have separate options for each type of warning?

What do you mean by each type here? I'm proposing to only doing this for PerformanceWarning.

Consider the 3 examples that you list above. You could have a separate option for turning on/off the PerformanceWarning for each of those examples.

@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 9, 2023

It sounds like you have a use case in mind that is at odds with what I'm proposing here, but I'm not sure what that use case is. To be sure, what I'm proposing is that users don't typically run with PerformanceWarning enabled - both in the development of new code and in repeated runs of existing. Rather, a user has a piece of code they want to make more performant, they enable the PerformanceWarnings to see where pandas makes suggestions, modify the code, and then disable PerformanceWarnings as they run it in the future.

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 9, 2023

I'm curious to learn more about the use case you have in mind where (I think) every different warning has an option to control it.

Let's say I only want to find where I can remove .copy(). So I don't care about other performance warnings. I just want to know where I can get rid of those extra defensive .copy() calls.

@rhshadrach
Copy link
Member Author

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 9, 2023

I think users could accomplish this by grepping the output. But for both maintainers and users in the scenario I proposed, it seems like a headache to have a plethora of different options.

That makes sense.

@jorisvandenbossche
Copy link
Member

To have an idea about the current warnings we might be talking about, a list of PerformanceWarnings I found with a quick search:

  • "Adding/subtracting object-dtype array to [DatetimeLikeArray] not vectorized." (note: I think this is typically for DateOffsets)
  • "Non-vectorized DateOffset being applied to Series or DatetimeIndex." (similar as above)
  • "Falling back on a non-pyarrow code path which may decrease performance." (for some corner case features in our string methods)
  • For SparseArray:
    • "searchsorted requires high memory usage."
    • "Concatenating sparse arrays with multiple fill values"
  • MultiIndex:
    • "dropping on a non-lexsorted multi-index without a level parameter may impact performance."
    • "indexing past lexsort depth may impact performance."
  • "DataFrame is highly fragmented. ..."
  • unstack: "The following operation may generate {num_cells} cells in the resulting pandas object."
  • align: "Alignment difference on axis {axis} is larger than an order of magnitude on term {repr(terms[i].name)}, by more than {ordm:.4g}; performance may suffer." (never seen this one ;))

I think I am +1 on the proposal. In any case it would be nice to have an easier way to turn them off.

@arnaudlegout
Copy link
Contributor

arnaudlegout commented Nov 14, 2023

I would also add all pandas methods that have a pure Python implementation that is known to be slow. Even if there is no alternative in pandas, there might be an alternative with an optimized implementation in another library. Getting this warning will at least give a hint where my code could be slow.

Here are other suggestions. I do not claim they must lead to a PerformanceWarning, but that it is practical performance issues in my code that I spent time finding and fixing. Having a suggestion (with a PerformanceWarning) would have saved me a lot of time.

  • np.isin is faster than Series.isin
  • df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()
  • array[row, col] (array being a numpy array) is much faster (250x) than df.at[row, col] (assuming you can convert you df to a numpy array, basically, all columns are of the same numpy dtype)

@phofl
Copy link
Member

phofl commented Nov 14, 2023

@arnaudlegout These examples aren't black and white, the array calls will be slower depending on what dtypes you have, for example the last one would trigger a copy if your dataframe has multiple dtypes and you convert to a numpy array

@arnaudlegout
Copy link
Contributor

@phofl right. I did not intend to say they are clear perf warning candidates. When you implement a method, you know you have certain code paths that are suboptimal, in that case, it could be nice to add a PerformanceWarning to make clear to the user you are in a slow code path execution.

As another perf warning candidate, there is a groupby on a non-sorted index vs. a sorted index.

@jorisvandenbossche
Copy link
Member

  • df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

@arnaudlegout
Copy link
Contributor

  • df.groupby(group_col)[col].unique().apply(len) is faster than df.groupby(group_col)[col].nunique()

@arnaudlegout if you have a specific example that shows this, I think it would be worth opening an issue for this. Because to me it seems the purpose of nunique() to be more efficient (in addition to more convenient to use). So if that is not that the case, IMO we shouldn't warn about that but rather fix that ;) (for example by simply using the .unique().apply(len) under the hood)

I opened issue #55972
In my reproducible example, unique.apply(len) is 3x faster than nunique on a Series.groupby

@bionicles
Copy link

bionicles commented Jan 15, 2024

Just throwing it out there, detecting testing time, e.g.:

if "PYTEST_CURRENT_TEST" in os.environ:

Would be a wonderful time to turn on these warnings.

Second, the regex capturing groups warning counts because it compiles a regex pattern every call to .str.contains to warn us about unused capturing groups. I use named groups to organize complicated regex spaghetti, so this warning just gets ignored. Compiling a pattern every time just to throw it away is a waste of time.

IMHO it could be completely removed. The issue is here: #56798 and my PR to delete it didn't get merged but it's an easy fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

No branches or pull requests

7 participants