PERF: Implement groupby idxmax/idxmin in Cython #54234

rhshadrach · 2023-07-23T16:08:23Z

closes BUG: GroupBy.idxmax breaks when grouping by Categorical with unused categories #10694 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Adds Cython code for idxmin/idxmax in groupby.

Code

for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    df = pd.DataFrame(
        {
            'a': np.random.randint(0, 100, size),
            'b': np.random.random(size),
        }
    )
    gb = df.groupby('a')
    print(size)
    %timeit gb.idxmin()
    print()

size: 100
7.91 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
53.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 1000
12.6 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
52.4 µs ± 513 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 10000
12.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
67.9 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <-- PR

size: 100000
13 ms ± 9.81 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
207 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

size: 1000000
18.3 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- main
1.56 ms ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) <-- PR

size: 10000000
188 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)   <-- main
15.8 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- PR

Added ASVs (the min line is just noise I think)

       before           after         ratio
     [3c01ce2a]       [73927a73]
     <gb_mi_cat_result_index>       <gb_idxmax_unobserved_cat>
-      15.5±0.2ms       13.8±0.2ms     0.89  groupby.GroupByCythonAgg.time_frame_agg('float64', 'min')
-      78.4±0.6ms       13.9±0.3ms     0.18  groupby.GroupByCythonAgg.time_frame_agg('float64', 'idxmin')
-      79.5±0.7ms       13.1±0.4ms     0.16  groupby.GroupByCythonAgg.time_frame_agg('float64', 'idxmax')

jbrockmendel · 2023-07-23T16:11:47Z

Looks really similar to #52339 but you got it working!

rhshadrach · 2023-07-23T16:42:50Z

Ahhh, shoot! I forgot you had been working on this. Couple of minor differences and things I'll steal from yours if we're going forward with this (e.g. argmin/argmax instead of idxmin/idxmax; initializing out to -1 in Cython rather than where I have it here).

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

* ENH: non float64 result support in numba groupby * refactor & simplify * fix CI * maybe green? * skip unsupported ops in other bench as well * updates from code review * remove commented code * update whatsnew * debug benchmarks * Skip min/max benchmarks

doc/source/whatsnew/v2.1.0.rst

pandas/_libs/groupby.pyx

jbrockmendel · 2023-07-24T19:41:59Z

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

bc it is only passed from _idxmin_idxmax, could we avoid having the argument at all and just do the post-processing there?

rhshadrach · 2023-07-24T21:59:09Z

But I think the main difference is the use of post_processing here vs calling _cython_agg_general in yours. What do you think about the additional argument?

bc it is only passed from _idxmin_idxmax, could we avoid having the argument at all and just do the post-processing there?

That was my first attempt. The trouble is we do a bunch of post-processing already in _cython_agg_general (handling observed, as_index in particular) that gets in the way, and then we need to repeat this in idxmin/idxmax. Using the post_processing seems cleaner, at least with the current state.

doc/source/user_guide/groupby.rst

pandas/core/groupby/generic.py

jbrockmendel · 2023-07-24T22:21:00Z

pandas/core/groupby/groupby.py

+                index = self.obj.index
+                if isinstance(index, MultiIndex):
+                    index = index.to_flat_index()
+                result = index.take(x).values


could be simpler to do index.array.take(x, allow_fill=True)?

For most index types, this attempts to call _take_nd_ndarray which then calls _libs.algos.take_1d_int64_float64 and fails since x here is (often) 2d.

im confused then. index.take allows a 2D indices?

there are other places where we need to do a take-with-fill on an Index and we somewhat awkwardly use algos.take. Might be better long-term to support allow_fill in Index.take (to mirror EA.take) or have a private method that allows it

pandas/core/groupby/groupby.py

jbrockmendel · 2023-07-25T16:22:54Z

Using the post_processing seems cleaner, at least with the current state.

Seems reasonable, thanks for taking a look.

jbrockmendel · 2023-07-26T01:28:19Z

pandas/core/indexes/datetimelike.py

-        if isinstance(maybe_slice, slice):
-            freq = self._data._get_getitem_freq(maybe_slice)
-            result._data._freq = freq
+        if indices.ndim == 1:


this seems very weird. if indices.ndim > 1 then result.ndim > 1 and that cant be valid for an Index?

Indeed, I think I need to find another way to approach this. The following is for Int64, I figure something similar applies to datetimelike (but haven't checked).

index = pd.Index([1, 2, 3], dtype="Int64") indices = np.array([[0, 1], [-1, 2]]) result = index.take(indices) print(result) # Index([[1, 2], [3, 3]], dtype='Int64') print(result.values) # <IntegerArray> # [ # [1, 2], # [3, 3] # ] # Shape: (2, 2), dtype: Int64

jbrockmendel · 2023-07-26T01:28:42Z

pandas/core/groupby/ops.py

+        elif how in ["idxmin", "idxmax"]:
+            # The Cython implementation only produces the row number; we'll take
+            # from the index using this in post processing
+            out_dtype = "int64"


i think np.intp?

jbrockmendel · 2023-07-26T01:29:57Z

im not clear on how this fixes #10694. isnt raising bc of an empty sequence correct there?

rhshadrach · 2023-07-26T02:28:11Z

im not clear on how this fixes #10694. isnt raising bc of an empty sequence correct there?

I keep going back and forth over whether no observations should be NA or raise. While working on this it seemed like making this raise effectively makes observed=False useless for idxmin/idxmax, and that many (all?) other ops fill in with NA if no other sentinels make sense. But thinking about this again, it should agree with e.g. df.idxmin() where df has no rows - I'm going to change this so it raises instead.

pandas/core/groupby/groupby.py

pandas/core/groupby/ops.py

jbrockmendel · 2023-10-16T16:49:44Z

pandas/tests/groupby/test_groupby_dropna.py

@@ -564,9 +564,10 @@ def test_categorical_reducers(reduction_func, observed, sort, as_index, index_ki
        values = expected["y"].values.tolist()
        if index_kind == "single":
            values = [np.nan if e == 4 else e for e in values]
+            expected["y"] = pd.Categorical(values, categories=[1, 2, 3])


do the test changes indicate bugfixes?

Yea - I've added two whatsnew notes, one for this and the other for transform.

…dxmax_unobserved_cat

rhshadrach · 2023-10-17T03:00:37Z

pandas/tests/groupby/test_groupby.py

@@ -2063,7 +2063,7 @@ def get_categorical_invalid_expected():
        with pytest.raises(klass, match=msg):
            get_result()

-        if op in ["min", "max"] and isinstance(columns, list):
+        if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list):


This change is covered by the whatsnew note added in #55268

…dxmax_unobserved_cat � Conflicts: � doc/source/whatsnew/v2.2.0.rst

rhshadrach · 2023-10-20T16:52:05Z

@jbrockmendel - friendly ping

jbrockmendel · 2023-10-23T02:18:34Z

will prioritize this Monday AM

pandas/core/groupby/groupby.py

jbrockmendel · 2023-10-23T15:40:49Z

Couple of questions about reachability, otherwise LGTM

…dxmax_unobserved_cat

jbrockmendel · 2023-10-26T22:15:59Z

LGTM

mroeschke · 2023-10-27T02:10:44Z

Nice! Thanks @rhshadrach

jbrockmendel · 2023-10-27T19:11:16Z

thanks for seeing this through @rhshadrach!

PERF: Implement groupby idxmax/idxmin in Cython

5acec90

rhshadrach added Groupby Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. labels Jul 23, 2023

rhshadrach added this to the 2.1 milestone Jul 23, 2023

rhshadrach requested a review from WillAyd as a code owner July 23, 2023 16:08

Update docs

60407c5

Add ASVs

73927a7

mypy fixup

6958c95