-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268
BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Friendly ping @WillAyd @jbrockmendel |
kwargs["engine"] = engine | ||
kwargs["engine_kwargs"] = engine_kwargs | ||
result = getattr(self, func)(*args, **kwargs) | ||
if func in ["idxmin", "idxmax"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to stick with the same engine pattern that is in place for this? At first glance I'm wondering what makes these different from min
/`max`` that requires branching here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is the same issue as i asked about somewhere else: _idxmax_idxmin accepts ignore_unobserved while idxmin/idxmax do not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference in behavior is that min/max do not raise on unobserved categories, while idxmin and idxmax do.
Example - agg
df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]})
gb = df.groupby('a', observed=False)
result = gb.min()
print(result)
# b
# a
# 1 3.0
# 2 5.0
# 3 NaN
However, the fact that we don't do something special for min/max means that transform unnecessarily coerces to float:
Example - transform
df = pd.DataFrame({'a': pd.Categorical([1, 1, 2], categories=[1, 2, 3]), 'b': [3, 4, 5]})
gb = df.groupby('a', observed=False)
result = gb.transform('min')
print(result)
# b
# 0 3.0
# 1 3.0
# 2 5.0
I consider this a bug in min/max with transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #55326 to track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks - that is helpful context. So do you see the solution for min/max being the same as what you have for idxmin/idxmax here?
I think the broader issue is that we've wanted over time to move away from branching for function specializations within groupby. If that still holds true then I wonder what prevents us from sticking with the existing kwargs interface to solve both this PR and eventually solve min/max's issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to explore different solutions for min/max and other aggregations, but I don't know what that could be at this time.
I don't understand what you're suggesting with kwargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought the engine / engine_kwargs were specialized arguments for each function implementation, but I see now those are meant for numba.
The numba functions are UDFs right? I'm assuming from the branch here that we would never want to pass numba arguments to _idxmax_idxmin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - we never get here when using numba.
pandas/core/groupby/groupby.py
Outdated
# an empty DataFrame with an index (possibly including unobserved) but no | ||
# columns | ||
data = self._obj_with_exclusions | ||
if isinstance(data, DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(data, DataFrame): | |
if raise_err and isinstance(data, DataFrame): |
I don't think we even need to go down this path if raise_err
isn't True to being with right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - 🤦not raise_err
, right?
pandas/core/groupby/groupby.py
Outdated
if isinstance(data, DataFrame): | ||
if numeric_only: | ||
data = data._get_numeric_data() | ||
raise_err &= len(data.columns) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above is correct this can become simple assignment
@WillAyd - are you good here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment/question but otherwise I think this looks good
pandas/core/groupby/groupby.py
Outdated
if func in ["idxmin", "idxmax"]: | ||
func = cast(Literal["idxmin", "idxmax"], func) | ||
result = self._idxmax_idxmin( | ||
func, ignore_unobserved=True, axis=self.axis, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely don't need *args here right? Seems like something could get discarded compared to the other branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack - thanks! Fixed and test added.
…dxmin_idxmax_unobserved � Conflicts: � doc/source/whatsnew/v2.2.0.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
|
||
Parameters | ||
---------- | ||
how: {"idxmin", "idxmax"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick i think missing space between "how" and colon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also usually but not always we have single-quotes inside these (no idea why and i genuinely dont care, extra since this is private)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - fixed in #54234
) | ||
except ValueError as err: | ||
name = "argmax" if how == "idxmax" else "argmin" | ||
if f"attempt to get {name} of an empty sequence" in str(err): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be simpler if we just said changed "arg" to "idx" in the cython method with a comment as to why we are using an apparently-wrong message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arises from a call to NumPy's argmin in nanops.nanargmin:
pd.Series().idxmin()
# ValueError: attempt to get argmin of an empty sequence
In #54234, this remains only for the axis=1
case, and so once that deprecation is enforced, this code will be removed entirely.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Part of #10694, but doesn't close it fully.
numeric_only=True
.All of these are fixed in #54234