-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Implement groupby idxmax/idxmin in Cython #54234
PERF: Implement groupby idxmax/idxmin in Cython #54234
Conversation
Looks really similar to #52339 but you got it working! |
Ahhh, shoot! I forgot you had been working on this. Couple of minor differences and things I'll steal from yours if we're going forward with this (e.g. argmin/argmax instead of idxmin/idxmax; initializing But I think the main difference is the use of post_processing here vs calling |
* ENH: non float64 result support in numba groupby * refactor & simplify * fix CI * maybe green? * skip unsupported ops in other bench as well * updates from code review * remove commented code * update whatsnew * debug benchmarks * Skip min/max benchmarks
bc it is only passed from _idxmin_idxmax, could we avoid having the argument at all and just do the post-processing there? |
That was my first attempt. The trouble is we do a bunch of post-processing already in |
pandas/core/groupby/groupby.py
Outdated
index = self.obj.index | ||
if isinstance(index, MultiIndex): | ||
index = index.to_flat_index() | ||
result = index.take(x).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be simpler to do index.array.take(x, allow_fill=True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most index types, this attempts to call _take_nd_ndarray
which then calls _libs.algos.take_1d_int64_float64
and fails since x
here is (often) 2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im confused then. index.take allows a 2D indices
?
there are other places where we need to do a take-with-fill on an Index and we somewhat awkwardly use algos.take
. Might be better long-term to support allow_fill in Index.take (to mirror EA.take) or have a private method that allows it
Seems reasonable, thanks for taking a look. |
pandas/core/indexes/datetimelike.py
Outdated
if isinstance(maybe_slice, slice): | ||
freq = self._data._get_getitem_freq(maybe_slice) | ||
result._data._freq = freq | ||
if indices.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems very weird. if indices.ndim > 1 then result.ndim > 1 and that cant be valid for an Index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I think I need to find another way to approach this. The following is for Int64, I figure something similar applies to datetimelike (but haven't checked).
index = pd.Index([1, 2, 3], dtype="Int64")
indices = np.array([[0, 1], [-1, 2]])
result = index.take(indices)
print(result)
# Index([[1, 2], [3, 3]], dtype='Int64')
print(result.values)
# <IntegerArray>
# [
# [1, 2],
# [3, 3]
# ]
# Shape: (2, 2), dtype: Int64
pandas/core/groupby/ops.py
Outdated
elif how in ["idxmin", "idxmax"]: | ||
# The Cython implementation only produces the row number; we'll take | ||
# from the index using this in post processing | ||
out_dtype = "int64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think np.intp?
im not clear on how this fixes #10694. isnt raising bc of an empty sequence correct there? |
I keep going back and forth over whether no observations should be NA or raise. While working on this it seemed like making this raise effectively makes |
@@ -564,9 +564,10 @@ def test_categorical_reducers(reduction_func, observed, sort, as_index, index_ki | |||
values = expected["y"].values.tolist() | |||
if index_kind == "single": | |||
values = [np.nan if e == 4 else e for e in values] | |||
expected["y"] = pd.Categorical(values, categories=[1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the test changes indicate bugfixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - I've added two whatsnew notes, one for this and the other for transform.
…dxmax_unobserved_cat
@@ -2063,7 +2063,7 @@ def get_categorical_invalid_expected(): | |||
with pytest.raises(klass, match=msg): | |||
get_result() | |||
|
|||
if op in ["min", "max"] and isinstance(columns, list): | |||
if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is covered by the whatsnew note added in #55268
…dxmax_unobserved_cat � Conflicts: � doc/source/whatsnew/v2.2.0.rst
@jbrockmendel - friendly ping |
will prioritize this Monday AM |
Couple of questions about reachability, otherwise LGTM |
LGTM |
Nice! Thanks @rhshadrach |
thanks for seeing this through @rhshadrach! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Adds Cython code for idxmin/idxmax in groupby.
Code
Added ASVs (the min line is just noise I think)