Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268

Merged
merged 11 commits into from
Oct 8, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Sep 24, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Part of #10694, but doesn't close it fully.

  • When given a non-ordered category with unobserved categories, we currently raise about the unobserved categories. We should instead raise about the non-orderedness for consistency with min/max.
  • _python_apply_general does not return the correct dtype when there is a CategoricalIndex will NaN values.
  • _python_apply_general fails on a single grouping with unobserved categories even when we're call it from transform where the unobserved categories should have no impact.
  • _python_apply_general with an empty DataFrame with no numeric columns returns all the columns even when numeric_only=True.

All of these are fixed in #54234

@rhshadrach rhshadrach added Groupby Categorical Categorical Data Type Reduction Operations sum, mean, min, max, etc. labels Sep 24, 2023
@rhshadrach rhshadrach added this to the 2.2 milestone Sep 24, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhshadrach
Copy link
Member Author

Friendly ping @WillAyd @jbrockmendel

kwargs["engine"] = engine
kwargs["engine_kwargs"] = engine_kwargs
result = getattr(self, func)(*args, **kwargs)
if func in ["idxmin", "idxmax"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to stick with the same engine pattern that is in place for this? At first glance I'm wondering what makes these different from min/`max`` that requires branching here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the same issue as i asked about somewhere else: _idxmax_idxmin accepts ignore_unobserved while idxmin/idxmax do not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference in behavior is that min/max do not raise on unobserved categories, while idxmin and idxmax do.

Example - agg
df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]})
gb = df.groupby('a', observed=False)
result = gb.min()
print(result)
#      b
# a     
# 1  3.0
# 2  5.0
# 3  NaN

However, the fact that we don't do something special for min/max means that transform unnecessarily coerces to float:

Example - transform
df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]})
gb = df.groupby('a', observed=False)
result = gb.transform('min')
print(result)
#      b
# 0  3.0
# 1  3.0
# 2  5.0

I consider this a bug in min/max with transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #55326 to track

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks - that is helpful context. So do you see the solution for min/max being the same as what you have for idxmin/idxmax here?

I think the broader issue is that we've wanted over time to move away from branching for function specializations within groupby. If that still holds true then I wonder what prevents us from sticking with the existing kwargs interface to solve both this PR and eventually solve min/max's issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to explore different solutions for min/max and other aggregations, but I don't know what that could be at this time.

I don't understand what you're suggesting with kwargs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought the engine / engine_kwargs were specialized arguments for each function implementation, but I see now those are meant for numba.

The numba functions are UDFs right? I'm assuming from the branch here that we would never want to pass numba arguments to _idxmax_idxmin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - we never get here when using numba.

# an empty DataFrame with an index (possibly including unobserved) but no
# columns
data = self._obj_with_exclusions
if isinstance(data, DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(data, DataFrame):
if raise_err and isinstance(data, DataFrame):

I don't think we even need to go down this path if raise_err isn't True to being with right?

Copy link
Member Author

@rhshadrach rhshadrach Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - not raise_err, right? 🤦

if isinstance(data, DataFrame):
if numeric_only:
data = data._get_numeric_data()
raise_err &= len(data.columns) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is correct this can become simple assignment

@rhshadrach
Copy link
Member Author

@WillAyd - are you good here?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment/question but otherwise I think this looks good

if func in ["idxmin", "idxmax"]:
func = cast(Literal["idxmin", "idxmax"], func)
result = self._idxmax_idxmin(
func, ignore_unobserved=True, axis=self.axis, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't need *args here right? Seems like something could get discarded compared to the other branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - thanks! Fixed and test added.

…dxmin_idxmax_unobserved

� Conflicts:
�	doc/source/whatsnew/v2.2.0.rst
@rhshadrach rhshadrach requested a review from WillAyd October 7, 2023 17:45
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@rhshadrach rhshadrach merged commit 3fac6f2 into pandas-dev:main Oct 8, 2023
33 checks passed
@rhshadrach rhshadrach deleted the gb_idxmin_idxmax_unobserved branch October 8, 2023 18:50

Parameters
----------
how: {"idxmin", "idxmax"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick i think missing space between "how" and colon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also usually but not always we have single-quotes inside these (no idea why and i genuinely dont care, extra since this is private)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - fixed in #54234

)
except ValueError as err:
name = "argmax" if how == "idxmax" else "argmin"
if f"attempt to get {name} of an empty sequence" in str(err):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be simpler if we just said changed "arg" to "idx" in the cython method with a comment as to why we are using an apparently-wrong message?

Copy link
Member Author

@rhshadrach rhshadrach Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arises from a call to NumPy's argmin in nanops.nanargmin:

pd.Series().idxmin()
# ValueError: attempt to get argmin of an empty sequence

In #54234, this remains only for the axis=1 case, and so once that deprecation is enforced, this code will be removed entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants