-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/PERF: groupby.transform with unobserved categories #58084
BUG/PERF: groupby.transform with unobserved categories #58084
Conversation
Is there an issue linked with this? |
No clue. |
this is a work in progress for issue #55326 , i have added the issue number |
ok, the pr implementation is completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! In addition to the issue highlighted below, I think it might be a better approach to compute the result using only observed data for transforms. Not only would that fix this issue, but it would also give a good performance gain. This is on my radar to look into and may not work out, but I think it should be tried first before other approaches. If you would like to give this a shot, please feel free!
pandas/core/groupby/ops.py
Outdated
if remove_nan: | ||
mask = np.zeros(shape=values.shape, dtype=bool) | ||
result_mask = np.zeros(shape=(1, ngroups), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't just ignore mask
, e.g. this gives the wrong result
data = pd.array([pd.NA, 2, 3, 4], dtype="Int64")
df = DataFrame({"key": ["a", "a", "b", "b"], "col": data})
grouped = df.groupby("key", observed=False)
print(grouped.transform("min"))
# col
# 0 1
# 1 1
# 2 3
# 3 3
pandas/core/groupby/groupby.py
Outdated
@@ -3089,6 +3139,7 @@ def min( | |||
min_count: int = -1, | |||
engine: Literal["cython", "numba"] | None = None, | |||
engine_kwargs: dict[str, bool] | None = None, | |||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try very hard to avoid adding kwargs to a method for internal use.
be71a4d
to
898fd12
Compare
8c1cef0
to
baa1b28
Compare
HI thank you for the pr review, I have changed my implementation to temporarily set observed to true (and respective groupers), so that transform will return the correct result. I have initially tried to change the result of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking close to what I was envisioning, though more attributes appear to need to be modified than I was hoping. This introduces fragility (e.g. adding a new cached attribute could break things) and possibly hard to detect bugs (issues that would only show up if you reuse a groupby instance with two different operations in a certain order). It's still the best way I see to solve it.
pandas/core/groupby/groupby.py
Outdated
grouper, exclusions, obj = get_grouper( | ||
self.orig_obj, | ||
self.keys, | ||
level=self.level, | ||
sort=self.sort, | ||
observed=True, | ||
dropna=self.dropna, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to cache this on the groupby instance - we do not want to have to recompute it if the groupby is reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved, the group by init now accepts observed_grouper, observed_exclusions
params
pandas/core/groupby/groupby.py
Outdated
com.temp_setattr(self, "observed", True), | ||
com.temp_setattr(self, "_grouper", grouper), | ||
com.temp_setattr(self, "exclusions", exclusions), | ||
com.temp_setattr(self, "obj", obj, condition=obj_has_not_changed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we unconditionally set obj
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved, removed setting obj to obj
af75b3a
to
30013ee
Compare
73a6fef
to
3b9d27b
Compare
Thank you for the review, I have made the changes as requested |
pandas/core/groupby/groupby.py
Outdated
"observed_grouper", | ||
"observed_exclusions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, I recommend adding it as a cached method on the BaseGrouper
class in ops.py
.
@cache_readonly
def observed_grouper(self):
if all(ping._observed for ping in self.groupings):
return self
grouper = BaseGrouper(...)
return grouper
For this to work, you also need to do the same to Grouping:
@cache_readonly
def observed_grouping(self):
if self._observed:
return self
grouping = Grouping(...)
return grouping
and use the observed_groupings in the BaseGrouper call above. For BinGrouper, I think you can just always return self
(doesn't change behavior on to observed=True/False).
Also, you can ignore exclusions
- this is independent of the grouping data stored in BaseGrouper/Grouping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, I have made the changes as suggested
Thank you for the review, i have made the changes as suggested |
Thanks for the changes @undermyumbrella1 - this is looking good! I have some minor refactor/style requests, but I'd like to get another eye here before any more work is done. @mroeschke - would you be able to take a look? In addition to the issue linked in the OP, this is fixing a regression caused by #55738:
While it's undesirable to swap out the grouper as is done here, I do not see any better way. There may be more efficient ways of computed the observed codes / result_index, but that can be readily built upon this later on. |
Thank you for the review, I have updated the pr according to comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few style requests, otherwise looks great!
|
||
|
||
# GH#58084 | ||
def test_min_multiple_unobserved_categories_no_type_coercion(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant to me - I think the above test is sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
|
||
|
||
# GH#58084 | ||
def test_min_float32_multiple_unobserved_categories_no_type_coercion(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you instead parametrize test_min_one_unobserved_category_no_type_coercion
. Something like
@pytest.mark.parametrize("dtype", ["int32", "float32"])
def test_min_one_unobserved_category_no_type_coercion(dtype):
...
df["B"] = df["B"].astype(dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
categories=[ | ||
1, | ||
"randomcat", | ||
100, | ||
333, | ||
"cat43543", | ||
-4325466, | ||
54665, | ||
-546767, | ||
"432945", | ||
767076, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need for so many here - can you make it 1-3 categories (so the test is more compact).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
pandas/core/groupby/generic.py
Outdated
@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None): | |||
elif ndim == 1: | |||
if subset is None: | |||
subset = self.obj[key] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this line addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still appears in the diff of this PR.
49f5a1e
to
f3a3f63
Compare
64aa8cd
to
58e759f
Compare
Thank you for the review, I have updated the pr according to comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good - just some unintentional changes to core/generic.py
and core/groupby/generic.py
- I think you deleted a line from the former instead of the latter 😄
Also - a note about force pushing. Force pushing on your PR is okay, but do know it can make review a little harder. Namely, when you force push the "Show changes since your last review" option no longer works.
pandas/core/groupby/generic.py
Outdated
@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None): | |||
elif ndim == 1: | |||
if subset is None: | |||
subset = self.obj[key] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still appears in the diff of this PR.
pandas/core/generic.py
Outdated
@@ -2055,7 +2055,6 @@ def __setstate__(self, state) -> None: | |||
object.__setattr__(self, "_attrs", attrs) | |||
flags = state.get("_flags", {"allows_duplicate_labels": True}) | |||
object.__setattr__(self, "_flags", Flags(self, **flags)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this line removal. Shouldn't have any diff in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
Thank you for the review, I have updated the pr according to comments. Noted on force pushing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @undermyumbrella1 - very nice! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.