BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

rhshadrach · 2023-09-24T20:10:46Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Part of #10694, but doesn't close it fully.

When given a non-ordered category with unobserved categories, we currently raise about the unobserved categories. We should instead raise about the non-orderedness for consistency with min/max.
_python_apply_general does not return the correct dtype when there is a CategoricalIndex will NaN values.
_python_apply_general fails on a single grouping with unobserved categories even when we're call it from transform where the unobserved categories should have no impact.
_python_apply_general with an empty DataFrame with no numeric columns returns all the columns even when numeric_only=True.

All of these are fixed in #54234

mroeschke

LGTM

rhshadrach · 2023-09-28T14:20:21Z

Friendly ping @WillAyd @jbrockmendel

WillAyd · 2023-09-29T13:33:42Z

pandas/core/groupby/groupby.py

-                    kwargs["engine"] = engine
-                    kwargs["engine_kwargs"] = engine_kwargs
-                result = getattr(self, func)(*args, **kwargs)
+                if func in ["idxmin", "idxmax"]:


Is it not possible to stick with the same engine pattern that is in place for this? At first glance I'm wondering what makes these different from min/`max`` that requires branching here

i think this is the same issue as i asked about somewhere else: _idxmax_idxmin accepts ignore_unobserved while idxmin/idxmax do not

The main difference in behavior is that min/max do not raise on unobserved categories, while idxmin and idxmax do.

Example - agg

df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]}) gb = df.groupby('a', observed=False) result = gb.min() print(result) # b # a # 1 3.0 # 2 5.0 # 3 NaN

However, the fact that we don't do something special for min/max means that transform unnecessarily coerces to float:

Example - transform

df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]}) gb = df.groupby('a', observed=False) result = gb.transform('min') print(result) # b # 0 3.0 # 1 3.0 # 2 5.0

I consider this a bug in min/max with transform.

I've opened #55326 to track

Ah OK thanks - that is helpful context. So do you see the solution for min/max being the same as what you have for idxmin/idxmax here?

I think the broader issue is that we've wanted over time to move away from branching for function specializations within groupby. If that still holds true then I wonder what prevents us from sticking with the existing kwargs interface to solve both this PR and eventually solve min/max's issue

I'd like to explore different solutions for min/max and other aggregations, but I don't know what that could be at this time.

I don't understand what you're suggesting with kwargs.

Sorry I thought the engine / engine_kwargs were specialized arguments for each function implementation, but I see now those are meant for numba.

The numba functions are UDFs right? I'm assuming from the branch here that we would never want to pass numba arguments to _idxmax_idxmin

Correct - we never get here when using numba.

WillAyd · 2023-09-29T13:41:19Z

pandas/core/groupby/groupby.py

+            # an empty DataFrame with an index (possibly including unobserved) but no
+            # columns
+            data = self._obj_with_exclusions
+            if isinstance(data, DataFrame):


Suggested change

if isinstance(data, DataFrame):

if raise_err and isinstance(data, DataFrame):

I don't think we even need to go down this path if raise_err isn't True to being with right?

~~Yea - not raise_err, right?~~ 🤦

WillAyd · 2023-09-29T13:42:00Z

pandas/core/groupby/groupby.py

+            if isinstance(data, DataFrame):
+                if numeric_only:
+                    data = data._get_numeric_data()
+                raise_err &= len(data.columns) > 0


If the above is correct this can become simple assignment

pandas/core/generic.py

.github/workflows/code-checks.yml

…adrach/pandas into gb_idxmin_idxmax_unobserved

rhshadrach · 2023-10-04T12:16:25Z

@WillAyd - are you good here?

WillAyd

minor comment/question but otherwise I think this looks good

WillAyd · 2023-10-04T16:43:01Z

pandas/core/groupby/groupby.py

+                if func in ["idxmin", "idxmax"]:
+                    func = cast(Literal["idxmin", "idxmax"], func)
+                    result = self._idxmax_idxmin(
+                        func, ignore_unobserved=True, axis=self.axis, **kwargs


We definitely don't need *args here right? Seems like something could get discarded compared to the other branch

Ack - thanks! Fixed and test added.

…dxmin_idxmax_unobserved � Conflicts: � doc/source/whatsnew/v2.2.0.rst

WillAyd

Lgtm

jbrockmendel · 2023-10-09T16:21:58Z

pandas/core/groupby/groupby.py

+
+        Parameters
+        ----------
+        how: {"idxmin", "idxmax"}


nitpick i think missing space between "how" and colon

also usually but not always we have single-quotes inside these (no idea why and i genuinely dont care, extra since this is private)

Thanks - fixed in #54234

jbrockmendel · 2023-10-09T16:25:26Z

pandas/core/groupby/groupby.py

+                )
+        except ValueError as err:
+            name = "argmax" if how == "idxmax" else "argmin"
+            if f"attempt to get {name} of an empty sequence" in str(err):


would this be simpler if we just said changed "arg" to "idx" in the cython method with a comment as to why we are using an apparently-wrong message?

This arises from a call to NumPy's argmin in nanops.nanargmin:

pd.Series().idxmin() # ValueError: attempt to get argmin of an empty sequence

In #54234, this remains only for the axis=1 case, and so once that deprecation is enforced, this code will be removed entirely.

rhshadrach added 2 commits September 24, 2023 15:43

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical

1325855

cleanups

e2dd88a

rhshadrach requested a review from mroeschke as a code owner September 24, 2023 20:10

rhshadrach added Groupby Categorical Categorical Data Type Reduction Operations sum, mean, min, max, etc. labels Sep 24, 2023

fixups

5daaaed

rhshadrach added this to the 2.2 milestone Sep 24, 2023

rhshadrach requested review from WillAyd and jbrockmendel September 24, 2023 20:34

type-hint fixup

c730ddb

mroeschke approved these changes Sep 25, 2023

View reviewed changes

WillAyd reviewed Sep 29, 2023

View reviewed changes

jbrockmendel reviewed Sep 29, 2023

View reviewed changes

pandas/core/generic.py Show resolved Hide resolved

jbrockmendel reviewed Sep 29, 2023

View reviewed changes

.github/workflows/code-checks.yml Show resolved Hide resolved

rhshadrach and others added 4 commits September 29, 2023 14:28

simplify

2bcdd81

Merge branch 'gb_idxmin_idxmax_unobserved' of https://github.com/rhsh…

a97cd21

…adrach/pandas into gb_idxmin_idxmax_unobserved

fixup

d62297c

Merge branch 'main' into gb_idxmin_idxmax_unobserved

34f4116

WillAyd reviewed Oct 4, 2023

View reviewed changes

rhshadrach added 3 commits October 7, 2023 10:41

Merge branch 'main' of https://github.com/pandas-dev/pandas into gb_i…

2598821

…dxmin_idxmax_unobserved � Conflicts: � doc/source/whatsnew/v2.2.0.rst

Fix passing through *args

b2a36eb

fixup

a1954d6

rhshadrach requested a review from WillAyd October 7, 2023 17:45

WillAyd approved these changes Oct 8, 2023

View reviewed changes

rhshadrach merged commit 3fac6f2 into pandas-dev:main Oct 8, 2023
33 checks passed

rhshadrach deleted the gb_idxmin_idxmax_unobserved branch October 8, 2023 18:50

jbrockmendel reviewed Oct 9, 2023

View reviewed changes

rhshadrach mentioned this pull request Oct 9, 2023

PERF: Implement groupby idxmax/idxmin in Cython #54234

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

rhshadrach commented Sep 24, 2023 •

edited

Loading

mroeschke left a comment

rhshadrach commented Sep 28, 2023

WillAyd Sep 29, 2023

jbrockmendel Sep 29, 2023

rhshadrach Sep 29, 2023

rhshadrach Sep 29, 2023

WillAyd Oct 2, 2023

rhshadrach Oct 2, 2023

WillAyd Oct 3, 2023

rhshadrach Oct 3, 2023

WillAyd Sep 29, 2023

rhshadrach Sep 29, 2023 •

edited

Loading

WillAyd Sep 29, 2023

rhshadrach commented Oct 4, 2023

WillAyd left a comment

WillAyd Oct 4, 2023

rhshadrach Oct 7, 2023

WillAyd left a comment

jbrockmendel Oct 9, 2023

jbrockmendel Oct 9, 2023

rhshadrach Oct 9, 2023

jbrockmendel Oct 9, 2023

rhshadrach Oct 9, 2023 •

edited

Loading

	if isinstance(data, DataFrame):
	if raise_err and isinstance(data, DataFrame):

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

Conversation

rhshadrach commented Sep 24, 2023 • edited Loading

mroeschke left a comment

Choose a reason for hiding this comment

rhshadrach commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Oct 4, 2023

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

rhshadrach commented Sep 24, 2023 •

edited

Loading

rhshadrach Sep 29, 2023 •

edited

Loading

rhshadrach Oct 9, 2023 •

edited

Loading