Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Implement groupby idxmax/idxmin in Cython #54234

Merged
merged 58 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
5acec90
PERF: Implement groupby idxmax/idxmin in Cython
rhshadrach Jul 19, 2023
60407c5
Update docs
rhshadrach Jul 23, 2023
73927a7
Add ASVs
rhshadrach Jul 23, 2023
6958c95
mypy fixup
rhshadrach Jul 23, 2023
f1b9ad1
Refinements
rhshadrach Jul 25, 2023
ec6a33b
Revert
rhshadrach Jul 25, 2023
cede8b4
Rework
rhshadrach Jul 29, 2023
03f30f6
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Jul 29, 2023
9e4aabc
Refinements
rhshadrach Jul 30, 2023
3e3f14b
fixup
rhshadrach Jul 30, 2023
fa941b2
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Jul 31, 2023
bfc9cc9
Fixup, show stderr in ASVs
rhshadrach Jul 31, 2023
212247b
Remove idxmin/idxmax from numba ASVs
rhshadrach Aug 1, 2023
bb65902
Merge branch 'gb_idxmax_unobserved_cat' of https://github.com/rhshadr…
rhshadrach Aug 1, 2023
590b6d4
WIP
rhshadrach Aug 3, 2023
929d5e9
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 12, 2023
8043f29
WIP
rhshadrach Aug 14, 2023
d5f328b
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 14, 2023
056c590
Rework
rhshadrach Aug 14, 2023
c0c79ef
Rework
rhshadrach Aug 14, 2023
b2641a4
fixup
rhshadrach Aug 14, 2023
cbcca6d
whatsnew
rhshadrach Aug 14, 2023
03d4510
refinements
rhshadrach Aug 14, 2023
6e727b4
cleanup
rhshadrach Aug 18, 2023
b56d49f
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Aug 29, 2023
51432ec
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 8, 2023
6a4f51e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 12, 2023
1268eac
Merge branch 'gb_idxmax_unobserved_cat' of https://github.com/rhshadr…
rhshadrach Sep 12, 2023
448581a
fixup type-hints in groupby.pyi
rhshadrach Sep 13, 2023
f9bb55e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 13, 2023
ff32210
Use mask instead of sentinel
rhshadrach Sep 14, 2023
57d7b81
fixup
rhshadrach Sep 14, 2023
75bde4b
fixup
rhshadrach Sep 14, 2023
587a054
fixup
rhshadrach Sep 15, 2023
f1d2b5c
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 15, 2023
00e4347
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 20, 2023
6658a98
seen -> unobserved; add assert
rhshadrach Sep 20, 2023
1539925
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Sep 20, 2023
dadd01e
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 24, 2023
9d7d082
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Sep 24, 2023
0bfd131
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 8, 2023
0d9d54c
Rework
rhshadrach Oct 8, 2023
363212d
cleanup
rhshadrach Oct 8, 2023
52a3413
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 9, 2023
30bc4c7
Fixup
rhshadrach Oct 9, 2023
95f35a4
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Oct 12, 2023
97a52f8
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 14, 2023
ff00e20
fixup
rhshadrach Oct 14, 2023
df282d8
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 17, 2023
3d2d8a0
Refinements
rhshadrach Oct 17, 2023
ad07653
fixup
rhshadrach Oct 17, 2023
b07e9ba
fixup
rhshadrach Oct 17, 2023
da8088c
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 17, 2023
38b3f38
Merge branch 'main' into gb_idxmax_unobserved_cat
rhshadrach Oct 20, 2023
5c416fe
WIP
rhshadrach Oct 24, 2023
75638a5
Avoid _maybe_mask_result
rhshadrach Oct 24, 2023
b666563
Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…
rhshadrach Oct 25, 2023
a8a5412
Add assert
rhshadrach Oct 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -511,8 +511,8 @@ listed below, those with a ``*`` do *not* have a Cython-optimized implementation
:meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups
:meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups
:meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group
:meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group
:meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group
:meth:`~.DataFrameGroupBy.idxmax`;Compute the index of the maximum value in each group
:meth:`~.DataFrameGroupBy.idxmin`;Compute the index of the minimum value in each group
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
:meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group
:meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group
:meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -416,7 +416,7 @@ Performance improvements
- Performance improvement when doing various reshaping operations on :class:`arrays.IntegerArrays` & :class:`arrays.FloatingArray` by avoiding doing unnecessary validation (:issue:`53013`)
- Performance improvement when indexing with pyarrow timestamp and duration dtypes (:issue:`53368`)
- Performance improvement when passing an array to :meth:`RangeIndex.take`, :meth:`DataFrame.loc`, or :meth:`DataFrame.iloc` and the DataFrame is using a RangeIndex (:issue:`53387`)
-
- Performance improvement in :meth:`SeriesGroupBy.idxmax`, :meth:`SeriesGroupBy.idxmin`, :meth:`DataFrameGroupBy.idxmax`, :meth:`DataFrameGroupBy.idxmin` (PR#??)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

.. ---------------------------------------------------------------------------
.. _whatsnew_210.bug_fixes:
Expand Down Expand Up @@ -574,6 +574,7 @@ Groupby/resample/rolling
- Bug in :meth:`GroupBy.var` failing to raise ``TypeError`` when called with datetime64, timedelta64 or :class:`PeriodDtype` values (:issue:`52128`, :issue:`53045`)
- Bug in :meth:`DataFrameGroupby.resample` with ``kind="period"`` raising ``AttributeError`` (:issue:`24103`)
- Bug in :meth:`Resampler.ohlc` with empty object returning a :class:`Series` instead of empty :class:`DataFrame` (:issue:`42902`)
- Bug in :meth:`SeriesGroupBy.idxmax`, :meth:`SeriesGroupBy.idxmin`, :meth:`DataFrameGroupBy.idxmax`, :meth:`DataFrameGroupBy.idxmin` would fail if the groupings have unobserved categories (:issue:`10694`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im back to being confused here. why isn't raising correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now consistently raise on empty groups from unobserved categories. We warn on empty groups from NA values in parallel with DataFrame/Series.

- Bug in :meth:`SeriesGroupBy.nth` and :meth:`DataFrameGroupBy.nth` after performing column selection when using ``dropna="any"`` or ``dropna="all"`` would not subset columns (:issue:`53518`)
- Bug in :meth:`SeriesGroupBy.nth` and :meth:`DataFrameGroupBy.nth` raised after performing column selection when using ``dropna="any"`` or ``dropna="all"`` resulted in rows being dropped (:issue:`53518`)
- Bug in :meth:`SeriesGroupBy.sum` and :meth:`DataFrameGroupby.sum` summing ``np.inf + np.inf`` and ``(-np.inf) + (-np.inf)`` to ``np.nan`` (:issue:`53606`)
Expand Down
142 changes: 142 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1780,6 +1780,96 @@ cdef group_min_max(
)


@cython.wraparound(False)
@cython.boundscheck(False)
cdef group_idxmin_idxmax(
int64_t[:, ::1] out,
int64_t[::1] counts,
ndarray[numeric_t, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
bint is_datetimelike=False,
bint compute_max=True,
const uint8_t[:, ::1] mask=None,
uint8_t[:, ::1] result_mask=None,
):
"""
Compute index of minimum/maximum of columns of `values`, in row groups `labels`.

This function only computes the row number where the minimum/maximum occurs, we'll
take the corresponding index value after this function.

Parameters
----------
out : np.ndarray[int64_t, ndim=2]
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
Array to store result in.
counts : np.ndarray[int64]
Input as a zeroed array, populated by group sizes during algorithm
values : array
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
Values to find column-wise min/max of.
labels : np.ndarray[np.intp]
Labels to group by.
min_count : Py_ssize_t, default -1
Unused. For compatibility with other Cython ops.
is_datetimelike : bool
True if `values` contains datetime-like entries.
compute_max : bint, default True
True to compute group-wise max, False to compute min
mask : ndarray[bool, ndim=2], optional
If not None, indices represent missing values,
otherwise the mask will not be used
result_mask : ndarray[bool, ndim=2], optional
If not None, these specify locations in the output that are NA.
Modified in-place.

Notes
-----
This method modifies the `out` parameter, rather than returning an object.
`counts` is modified to hold group sizes
"""
cdef:
Py_ssize_t i, j, N, K, lab
numeric_t val
numeric_t[:, ::1] group_min_or_max
bint uses_mask = mask is not None
bint isna_entry

# TODO(cython3):
# Instead of `labels.shape[0]` use `len(labels)`
if not len(values) == labels.shape[0]:
raise AssertionError("len(index) != len(labels)")

group_min_or_max = np.empty_like(out, dtype=values.dtype)
group_min_or_max[:] = _get_min_or_max(<numeric_t>0, compute_max, is_datetimelike)

N, K = (<object>values).shape

with nogil:
for i in range(N):
lab = labels[i]
if lab < 0:
continue

counts[lab] += 1
for j in range(K):
val = values[i, j]

if uses_mask:
isna_entry = mask[i, j]
else:
isna_entry = _treat_as_na(val, is_datetimelike)

if not isna_entry:
if compute_max:
if val > group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i
else:
if val < group_min_or_max[lab, j]:
group_min_or_max[lab, j] = val
out[lab, j] = i


@cython.wraparound(False)
@cython.boundscheck(False)
def group_max(
Expand All @@ -1806,6 +1896,32 @@ def group_max(
)


@cython.wraparound(False)
@cython.boundscheck(False)
def group_idxmax(
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
int64_t[:, ::1] out,
int64_t[::1] counts,
ndarray[numeric_t, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
bint is_datetimelike=False,
const uint8_t[:, ::1] mask=None,
uint8_t[:, ::1] result_mask=None,
) -> None:
"""See group_idxmin_idxmax.__doc__"""
group_idxmin_idxmax(
out,
counts,
values,
labels,
min_count=min_count,
is_datetimelike=is_datetimelike,
compute_max=True,
mask=mask,
result_mask=result_mask,
)


@cython.wraparound(False)
@cython.boundscheck(False)
def group_min(
Expand All @@ -1832,6 +1948,32 @@ def group_min(
)


@cython.wraparound(False)
@cython.boundscheck(False)
def group_idxmin(
int64_t[:, ::1] out,
int64_t[::1] counts,
ndarray[numeric_t, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
bint is_datetimelike=False,
const uint8_t[:, ::1] mask=None,
uint8_t[:, ::1] result_mask=None,
) -> None:
"""See group_idxmin_idxmax.__doc__"""
group_idxmin_idxmax(
out,
counts,
values,
labels,
min_count=min_count,
is_datetimelike=is_datetimelike,
compute_max=False,
mask=mask,
result_mask=result_mask,
)


@cython.boundscheck(False)
@cython.wraparound(False)
cdef group_cummin_max(
Expand Down
14 changes: 12 additions & 2 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2679,12 +2679,22 @@ def _groupby_op(
dtype = self.dtype
if how in ["sum", "prod", "cumsum", "cumprod", "skew"]:
raise TypeError(f"{dtype} type does not support {how} operations")
if how in ["min", "max", "rank"] and not dtype.ordered:
if how in ["min", "max", "rank", "idxmin", "idxmax"] and not dtype.ordered:
# raise TypeError instead of NotImplementedError to ensure we
# don't go down a group-by-group path, since in the empty-groups
# case that would fail to raise
raise TypeError(f"Cannot perform {how} with non-ordered Categorical")
if how not in ["rank", "any", "all", "first", "last", "min", "max"]:
if how not in [
"rank",
"any",
"all",
"first",
"last",
"min",
"max",
"idxmin",
"idxmax",
]:
if kind == "transform":
raise TypeError(f"{dtype} type does not support {how} operations")
raise TypeError(f"{dtype} dtype does not support aggregation '{how}'")
Expand Down
68 changes: 50 additions & 18 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1185,15 +1185,33 @@ def nsmallest(
def idxmin(
self, axis: Axis | lib.NoDefault = lib.no_default, skipna: bool = True
) -> Series:
result = self._op_via_apply("idxmin", axis=axis, skipna=skipna)
return result.astype(self.obj.index.dtype) if result.empty else result
if axis is not lib.no_default:
axis = self.obj._get_axis_number(axis)
self._deprecate_axis(axis, "idxmax")
if axis == 1:
result = self._op_via_apply("idxmax", axis=axis, skipna=skipna)
if result.empty:
result = result.astype(self.obj.index.dtype)
else:
alt = lambda x: x.idxmin(skipna=skipna)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
result = self._idxmax_idxmin("idxmin", alt)
return result

@doc(Series.idxmax.__doc__)
def idxmax(
self, axis: Axis | lib.NoDefault = lib.no_default, skipna: bool = True
) -> Series:
result = self._op_via_apply("idxmax", axis=axis, skipna=skipna)
return result.astype(self.obj.index.dtype) if result.empty else result
if axis is not lib.no_default:
axis = self.obj._get_axis_number(axis)
self._deprecate_axis(axis, "idxmax")
if axis == 1:
result = self._op_via_apply("idxmax", axis=axis, skipna=skipna)
if result.empty:
result = result.astype(self.obj.index.dtype)
else:
alt = lambda x: x.idxmax(skipna=skipna)
result = self._idxmax_idxmin("idxmax", alt)
return result

@doc(Series.corr.__doc__)
def corr(
Expand Down Expand Up @@ -2195,14 +2213,21 @@ def idxmax(
else:
axis = self.axis

def func(df):
return df.idxmax(axis=axis, skipna=skipna, numeric_only=numeric_only)
if axis == 1:

func.__name__ = "idxmax"
result = self._python_apply_general(
func, self._obj_with_exclusions, not_indexed_same=True
)
return result.astype(self.obj.index.dtype) if result.empty else result
def func(df):
return df.idxmax(axis=axis, skipna=skipna, numeric_only=numeric_only)

func.__name__ = "idxmax"
result = self._python_apply_general(
func, self._obj_with_exclusions, not_indexed_same=True
)
if result.empty:
result = result.astype(self.obj.index.dtype)
else:
alt = lambda x: x.idxmax(skipna=skipna)
result = self._idxmax_idxmin("idxmax", alt, numeric_only)
return result

def idxmin(
self,
Expand Down Expand Up @@ -2290,14 +2315,21 @@ def idxmin(
else:
axis = self.axis

def func(df):
return df.idxmin(axis=axis, skipna=skipna, numeric_only=numeric_only)
if axis == 1:

func.__name__ = "idxmin"
result = self._python_apply_general(
func, self._obj_with_exclusions, not_indexed_same=True
)
return result.astype(self.obj.index.dtype) if result.empty else result
def func(df):
return df.idxmin(axis=axis, skipna=skipna, numeric_only=numeric_only)

func.__name__ = "idxmin"
result = self._python_apply_general(
func, self._obj_with_exclusions, not_indexed_same=True
)
if result.empty:
result = result.astype(self.obj.index.dtype)
else:
alt = lambda x: x.idxmin(skipna=skipna)
result = self._idxmax_idxmin("idxmin", alt, numeric_only)
return result

boxplot = boxplot_frame_groupby

Expand Down
Loading