Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Implement groupby idxmax/idxmin in Cython #54234

Merged
merged 58 commits into from
Oct 27, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Jul 23, 2023

Adds Cython code for idxmin/idxmax in groupby.

Code
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    df = pd.DataFrame(
        {
            'a': np.random.randint(0, 100, size),
            'b': np.random.random(size),
        }
    )
    gb = df.groupby('a')
    print(size)
    %timeit gb.idxmin()
    print()
size: 100
7.91 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
53.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 1000
12.6 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
52.4 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 10000
12.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
67.9 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 100000
13 ms ± 9.81 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
207 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

size: 1000000
18.3 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
1.56 ms ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

size: 10000000
188 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)   <-- main
15.8 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- PR

Added ASVs (the min line is just noise I think)

       before           after         ratio
     [3c01ce2a]       [73927a73]
     <gb_mi_cat_result_index>       <gb_idxmax_unobserved_cat>
-      15.5±0.2ms       13.8±0.2ms     0.89  groupby.GroupByCythonAgg.time_frame_agg('float64', 'min')
-      78.4±0.6ms       13.9±0.3ms     0.18  groupby.GroupByCythonAgg.time_frame_agg('float64', 'idxmin')
-      79.5±0.7ms       13.1±0.4ms     0.16  groupby.GroupByCythonAgg.time_frame_agg('float64', 'idxmax')

@rhshadrach rhshadrach added Groupby Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. labels Jul 23, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jul 23, 2023
@rhshadrach rhshadrach requested a review from WillAyd as a code owner July 23, 2023 16:08
@jbrockmendel
Copy link
Member

Looks really similar to #52339 but you got it working!

@rhshadrach
Copy link
Member Author

Ahhh, shoot! I forgot you had been working on this. Couple of minor differences and things I'll steal from yours if we're going forward with this (e.g. argmin/argmax instead of idxmin/idxmax; initializing out to -1 in Cython rather than where I have it here).

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

rhshadrach referenced this pull request Jul 23, 2023
* ENH: non float64 result support in numba groupby

* refactor & simplify

* fix CI

* maybe green?

* skip unsupported ops in other bench as well

* updates from code review

* remove commented code

* update whatsnew

* debug benchmarks

* Skip min/max benchmarks
@jbrockmendel
Copy link
Member

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

bc it is only passed from _idxmin_idxmax, could we avoid having the argument at all and just do the post-processing there?

@rhshadrach
Copy link
Member Author

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

bc it is only passed from _idxmin_idxmax, could we avoid having the argument at all and just do the post-processing there?

That was my first attempt. The trouble is we do a bunch of post-processing already in _cython_agg_general (handling observed, as_index in particular) that gets in the way, and then we need to repeat this in idxmin/idxmax. Using the post_processing seems cleaner, at least with the current state.

index = self.obj.index
if isinstance(index, MultiIndex):
index = index.to_flat_index()
result = index.take(x).values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be simpler to do index.array.take(x, allow_fill=True)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most index types, this attempts to call _take_nd_ndarray which then calls _libs.algos.take_1d_int64_float64 and fails since x here is (often) 2d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im confused then. index.take allows a 2D indices?

there are other places where we need to do a take-with-fill on an Index and we somewhat awkwardly use algos.take. Might be better long-term to support allow_fill in Index.take (to mirror EA.take) or have a private method that allows it

@jbrockmendel
Copy link
Member

Using the post_processing seems cleaner, at least with the current state.

Seems reasonable, thanks for taking a look.

if isinstance(maybe_slice, slice):
freq = self._data._get_getitem_freq(maybe_slice)
result._data._freq = freq
if indices.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very weird. if indices.ndim > 1 then result.ndim > 1 and that cant be valid for an Index?

Copy link
Member Author

@rhshadrach rhshadrach Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I think I need to find another way to approach this. The following is for Int64, I figure something similar applies to datetimelike (but haven't checked).

index = pd.Index([1, 2, 3], dtype="Int64")
indices = np.array([[0, 1], [-1, 2]])
result = index.take(indices)

print(result)
# Index([[1, 2], [3, 3]], dtype='Int64')

print(result.values)
# <IntegerArray>
# [
# [1, 2],
# [3, 3]
# ]
# Shape: (2, 2), dtype: Int64

elif how in ["idxmin", "idxmax"]:
# The Cython implementation only produces the row number; we'll take
# from the index using this in post processing
out_dtype = "int64"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think np.intp?

@jbrockmendel
Copy link
Member

im not clear on how this fixes #10694. isnt raising bc of an empty sequence correct there?

@rhshadrach
Copy link
Member Author

im not clear on how this fixes #10694. isnt raising bc of an empty sequence correct there?

I keep going back and forth over whether no observations should be NA or raise. While working on this it seemed like making this raise effectively makes observed=False useless for idxmin/idxmax, and that many (all?) other ops fill in with NA if no other sentinels make sense. But thinking about this again, it should agree with e.g. df.idxmin() where df has no rows - I'm going to change this so it raises instead.

@rhshadrach rhshadrach marked this pull request as draft July 29, 2023 15:15
@@ -564,9 +564,10 @@ def test_categorical_reducers(reduction_func, observed, sort, as_index, index_ki
values = expected["y"].values.tolist()
if index_kind == "single":
values = [np.nan if e == 4 else e for e in values]
expected["y"] = pd.Categorical(values, categories=[1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the test changes indicate bugfixes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - I've added two whatsnew notes, one for this and the other for transform.

@@ -2063,7 +2063,7 @@ def get_categorical_invalid_expected():
with pytest.raises(klass, match=msg):
get_result()

if op in ["min", "max"] and isinstance(columns, list):
if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is covered by the whatsnew note added in #55268

@rhshadrach
Copy link
Member Author

@jbrockmendel - friendly ping

@jbrockmendel
Copy link
Member

will prioritize this Monday AM

@jbrockmendel
Copy link
Member

Couple of questions about reachability, otherwise LGTM

@jbrockmendel
Copy link
Member

LGTM

@mroeschke mroeschke merged commit ccca5df into pandas-dev:main Oct 27, 2023
32 of 33 checks passed
@mroeschke
Copy link
Member

Nice! Thanks @rhshadrach

@rhshadrach rhshadrach deleted the gb_idxmax_unobserved_cat branch October 27, 2023 03:15
@jbrockmendel
Copy link
Member

thanks for seeing this through @rhshadrach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.idxmax breaks when grouping by Categorical with unused categories
5 participants