Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Implement groupby idxmax/idxmin in Cython #54234

Merged
merged 58 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
5acec90
PERF: Implement groupby idxmax/idxmin in Cython
rhshadrach Jul 19, 2023
60407c5
Update docs
rhshadrach Jul 23, 2023
73927a7
Add ASVs
rhshadrach Jul 23, 2023
6958c95
mypy fixup
rhshadrach Jul 23, 2023
f1b9ad1
Refinements
rhshadrach Jul 25, 2023
ec6a33b
Revert
rhshadrach Jul 25, 2023
cede8b4
Rework
rhshadrach Jul 29, 2023
03f30f6
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Jul 29, 2023
9e4aabc
Refinements
rhshadrach Jul 30, 2023
3e3f14b
fixup
rhshadrach Jul 30, 2023
fa941b2
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Jul 31, 2023
bfc9cc9
Fixup, show stderr in ASVs
rhshadrach Jul 31, 2023
212247b
Remove idxmin/idxmax from numba ASVs
rhshadrach Aug 1, 2023
bb65902
Merge branch 'gb_idxmax_unobserved_cat' of https://github.com/rhshadr…
rhshadrach Aug 1, 2023
590b6d4
WIP
rhshadrach Aug 3, 2023
929d5e9
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 12, 2023
8043f29
WIP
rhshadrach Aug 14, 2023
d5f328b
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 14, 2023
056c590
Rework
rhshadrach Aug 14, 2023
c0c79ef
Rework
rhshadrach Aug 14, 2023
b2641a4
fixup
rhshadrach Aug 14, 2023
cbcca6d
whatsnew
rhshadrach Aug 14, 2023
03d4510
refinements
rhshadrach Aug 14, 2023
6e727b4
cleanup
rhshadrach Aug 18, 2023
b56d49f
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 29, 2023
51432ec
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 8, 2023
6a4f51e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 12, 2023
1268eac
Merge branch 'gb_idxmax_unobserved_cat' of https://github.com/rhshadr…
rhshadrach Sep 12, 2023
448581a
fixup type-hints in groupby.pyi
rhshadrach Sep 13, 2023
f9bb55e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 13, 2023
ff32210
Use mask instead of sentinel
rhshadrach Sep 14, 2023
57d7b81
fixup
rhshadrach Sep 14, 2023
75bde4b
fixup
rhshadrach Sep 14, 2023
587a054
fixup
rhshadrach Sep 15, 2023
f1d2b5c
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 15, 2023
00e4347
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 20, 2023
6658a98
seen -> unobserved; add assert
rhshadrach Sep 20, 2023
1539925
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Sep 20, 2023
dadd01e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 24, 2023
9d7d082
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 24, 2023
0bfd131
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 8, 2023
0d9d54c
Rework
rhshadrach Oct 8, 2023
363212d
cleanup
rhshadrach Oct 8, 2023
52a3413
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 9, 2023
30bc4c7
Fixup
rhshadrach Oct 9, 2023
95f35a4
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Oct 12, 2023
97a52f8
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 14, 2023
ff00e20
fixup
rhshadrach Oct 14, 2023
df282d8
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 17, 2023
3d2d8a0
Refinements
rhshadrach Oct 17, 2023
ad07653
fixup
rhshadrach Oct 17, 2023
b07e9ba
fixup
rhshadrach Oct 17, 2023
da8088c
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 17, 2023
38b3f38
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Oct 20, 2023
5c416fe
WIP
rhshadrach Oct 24, 2023
75638a5
Avoid _maybe_mask_result
rhshadrach Oct 24, 2023
b666563
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 25, 2023
a8a5412
Add assert
rhshadrach Oct 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@
"ffill",
"first",
"head",
"idxmax",
"idxmin",
"last",
"median",
"nunique",
Expand Down Expand Up @@ -588,6 +590,8 @@ class GroupByCythonAgg:
"prod",
"min",
"max",
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
"idxmin",
"idxmax",
"mean",
"median",
"var",
Expand Down
4 changes: 2 additions & 2 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -517,8 +517,8 @@ listed below, those with a ``*`` do *not* have a Cython-optimized implementation
:meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups
:meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups
:meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group
:meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group
:meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group
:meth:`~.DataFrameGroupBy.idxmax`;Compute the index of the maximum value in each group
:meth:`~.DataFrameGroupBy.idxmin`;Compute the index of the minimum value in each group
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
:meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group
:meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group
:meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group
Expand Down
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.sort_index` and :meth:`Series.sort_index` when indexed by a :class:`MultiIndex` (:issue:`54835`)
- Performance improvement in :meth:`Index.difference` (:issue:`55108`)
- Performance improvement in :meth:`Series.duplicated` for pyarrow dtypes (:issue:`55255`)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
- Performance improvement in :meth:`SeriesGroupBy.idxmax`, :meth:`SeriesGroupBy.idxmin`, :meth:`DataFrameGroupBy.idxmax`, :meth:`DataFrameGroupBy.idxmin` (:issue:`54234`)
- Performance improvement when indexing with more than 4 keys (:issue:`54550`)
- Performance improvement when localizing time to UTC (:issue:`55241`)

Expand Down Expand Up @@ -403,10 +404,11 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Bug in :class:`.Rolling` where duplicate datetimelike indexes are treated as consecutive rather than equal with ``closed='left'`` and ``closed='neither'`` (:issue:`20712`)
- Bug in :meth:`.DataFrameGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.SeriesGroupBy.idxmin`, and :meth:`.SeriesGroupBy.idxmax` would not retain :class:`.Categorical` dtype when the index was a :class:`.CategoricalIndex` that contained NA values (:issue:`54234`)
- Bug in :meth:`.DataFrameGroupBy.transform` and :meth:`.SeriesGroupBy.transform` when ``observed=False`` and ``f="idxmin"`` or ``f="idxmax"`` would incorrectly raise on unobserved categories (:issue:`54234`)
- Bug in :meth:`DataFrame.resample` not respecting ``closed`` and ``label`` arguments for :class:`~pandas.tseries.offsets.BusinessDay` (:issue:`55282`)
- Bug in :meth:`DataFrame.resample` where bin edges were not correct for :class:`~pandas.tseries.offsets.BusinessDay` (:issue:`55281`)
- Bug in :meth:`DataFrame.resample` where bin edges were not correct for :class:`~pandas.tseries.offsets.MonthBegin` (:issue:`55271`)
-

Reshaping
^^^^^^^^^
Expand Down
12 changes: 12 additions & 0 deletions pandas/_libs/groupby.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,18 @@ def group_min(
mask: np.ndarray | None = ...,
result_mask: np.ndarray | None = ...,
) -> None: ...
def group_idxmin_idxmax(
out: npt.NDArray[np.intp],
counts: npt.NDArray[np.int64],
values: np.ndarray, # ndarray[groupby_t, ndim=2]
labels: npt.NDArray[np.intp],
min_count: int = ...,
is_datetimelike: bool = ...,
mask: np.ndarray | None = ...,
name: str = ...,
skipna: bool = ...,
result_mask: np.ndarray | None = ...,
) -> None: ...
def group_cummin(
out: np.ndarray, # groupby_t[:, ::1]
values: np.ndarray, # ndarray[groupby_t, ndim=2]
Expand Down
145 changes: 145 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1794,6 +1794,151 @@ cdef group_min_max(
)


@cython.wraparound(False)
@cython.boundscheck(False)
def group_idxmin_idxmax(
intp_t[:, ::1] out,
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
int64_t[::1] counts,
ndarray[numeric_object_t, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
bint is_datetimelike=False,
const uint8_t[:, ::1] mask=None,
str name="idxmin",
bint skipna=True,
uint8_t[:, ::1] result_mask=None,
):
"""
Compute index of minimum/maximum of columns of `values`, in row groups `labels`.

This function only computes the row number where the minimum/maximum occurs, we'll
take the corresponding index value after this function.

Parameters
----------
out : np.ndarray[intp, ndim=2]
Array to store result in.
counts : np.ndarray[int64]
Input as a zeroed array, populated by group sizes during algorithm
values : np.ndarray[numeric_object_t, ndim=2]
Values to find column-wise min/max of.
labels : np.ndarray[np.intp]
Labels to group by.
min_count : Py_ssize_t, default -1
The minimum number of non-NA group elements, NA result if threshold
is not met.
is_datetimelike : bool
True if `values` contains datetime-like entries.
name : {"idxmin", "idxmax"}, default "idxmin"
Whether to compute idxmin or idxmax.
mask : ndarray[bool, ndim=2], optional
If not None, indices represent missing values,
otherwise the mask will not be used
skipna : bool, default True
Flag to ignore nan values during truth testing
result_mask : ndarray[bool, ndim=2], optional
If not None, these specify locations in the output that are NA.
Modified in-place.

Notes
-----
This method modifies the `out` parameter, rather than returning an object.
`counts` is modified to hold group sizes
"""
cdef:
Py_ssize_t i, j, N, K, lab
numeric_object_t val
numeric_object_t[:, ::1] group_min_or_max
bint uses_mask = mask is not None
bint isna_entry
bint compute_max = name == "idxmax"

assert name == "idxmin" or name == "idxmax"

# TODO(cython3):
# Instead of `labels.shape[0]` use `len(labels)`
if not len(values) == labels.shape[0]:
raise AssertionError("len(index) != len(labels)")

N, K = (<object>values).shape

if numeric_object_t is object:
group_min_or_max = np.empty((<object>out).shape, dtype=object)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
else:
group_min_or_max = np.empty_like(out, dtype=values.dtype)
if N > 0 and K > 0:
# When N or K is zero, we never use group_min_or_max
group_min_or_max[:] = _get_min_or_max(
values[0, 0], compute_max, is_datetimelike
)

# When using transform, we need a valid value for take in the case
# a category is not observed; these values will be dropped
out[:] = 0

# TODO(cython3): De-duplicate once conditional-nogil is available
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
if numeric_object_t is object:
for i in range(N):
lab = labels[i]
if lab < 0:
continue

for j in range(K):
if not skipna and out[lab, j] == -1:
# Once we've hit NA there is no going back
continue
val = values[i, j]

if uses_mask:
isna_entry = mask[i, j]
else:
# TODO(cython3): use _treat_as_na here
isna_entry = checknull(val)

if isna_entry:
if not skipna:
out[lab, j] = -1
else:
if compute_max:
if val > group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i
else:
if val < group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i
else:
with nogil:
for i in range(N):
lab = labels[i]
if lab < 0:
continue

for j in range(K):
if not skipna and out[lab, j] == -1:
# Once we've hit NA there is no going back
continue
val = values[i, j]

if uses_mask:
isna_entry = mask[i, j]
else:
isna_entry = _treat_as_na(val, is_datetimelike)

if isna_entry:
if not skipna:
out[lab, j] = -1
else:
if compute_max:
if val > group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i
else:
if val < group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i


@cython.wraparound(False)
@cython.boundscheck(False)
def group_max(
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2701,12 +2701,22 @@ def _groupby_op(
dtype = self.dtype
if how in ["sum", "prod", "cumsum", "cumprod", "skew"]:
raise TypeError(f"{dtype} type does not support {how} operations")
if how in ["min", "max", "rank"] and not dtype.ordered:
if how in ["min", "max", "rank", "idxmin", "idxmax"] and not dtype.ordered:
# raise TypeError instead of NotImplementedError to ensure we
# don't go down a group-by-group path, since in the empty-groups
# case that would fail to raise
raise TypeError(f"Cannot perform {how} with non-ordered Categorical")
if how not in ["rank", "any", "all", "first", "last", "min", "max"]:
if how not in [
"rank",
"any",
"all",
"first",
"last",
"min",
"max",
"idxmin",
"idxmax",
]:
if kind == "transform":
raise TypeError(f"{dtype} type does not support {how} operations")
raise TypeError(f"{dtype} dtype does not support aggregation '{how}'")
Expand All @@ -2716,7 +2726,7 @@ def _groupby_op(
if how == "rank":
assert self.ordered # checked earlier
npvalues = self._ndarray
elif how in ["first", "last", "min", "max"]:
elif how in ["first", "last", "min", "max", "idxmin", "idxmax"]:
npvalues = self._ndarray
result_mask = np.zeros(ngroups, dtype=bool)
else:
Expand Down
10 changes: 7 additions & 3 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -1506,9 +1506,13 @@ def _groupby_op(
arity = op._cython_arity.get(op.how, 1)
result_mask = np.tile(result_mask, (arity, 1)).T

# res_values should already have the correct dtype, we just need to
# wrap in a MaskedArray
return self._maybe_mask_result(res_values, result_mask)
if op.how in ["idxmin", "idxmax"]:
# Result values are indexes to take, keep as ndarray
return res_values
else:
# res_values should already have the correct dtype, we just need to
# wrap in a MaskedArray
return self._maybe_mask_result(res_values, result_mask)


def transpose_homogeneous_masked_arrays(
Expand Down
20 changes: 18 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,10 @@
deprecate_nonkeyword_arguments,
doc,
)
from pandas.util._exceptions import find_stack_level
from pandas.util._exceptions import (
find_stack_level,
rewrite_warning,
)
from pandas.util._validators import (
validate_ascending,
validate_bool_kwarg,
Expand Down Expand Up @@ -11371,7 +11374,20 @@ def _get_data() -> DataFrame:
row_index = np.tile(np.arange(nrows), ncols)
col_index = np.repeat(np.arange(ncols), nrows)
ser = Series(arr, index=col_index, copy=False)
result = ser.groupby(row_index).agg(name, **kwds)
# GroupBy will raise a warning with SeriesGroupBy as the object,
# likely confusing users
with rewrite_warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this context expensive? if so might make sense to only do it in idxmin/idxmax instead of in _reduce? (though this is only for axis=1 so NBD?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be very cheap:

size = 1000000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.randint(0, 100, size)})
df['a'] = df['a'].mask(df['a'] < 50, np.nan)
df['b'] = df['b'].mask(df['b'] < 50, np.nan)
%timeit df.idxmin(axis=1)

# Without context
# 111 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# With context
# 113 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In any case, if we were to try to move this any lower I think we'd need to detect who the caller is? We only want to replace this warning when groupby is called from _reduce.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see, use e.g. a nullcontext for other ops. I think this is cheap enough but can implement if you'd prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i had in mind was not having any context here but instead in (Series|DataFrame).idx(min|max). But if you say its cheap then its fine

target_message=(
f"The behavior of SeriesGroupBy.{name} with all-NA values"
),
target_category=FutureWarning,
new_message=(
f"The behavior of {type(self).__name__}.{name} with all-NA "
"values, or any-NA and skipna=False, is deprecated. In "
"a future version this will raise ValueError"
),
):
result = ser.groupby(row_index).agg(name, **kwds)
result.index = df.index
if not skipna and name not in ("any", "all"):
mask = df.isna().to_numpy(dtype=np.bool_).any(axis=1)
Expand Down
Loading
Loading