Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/PERF: groupby.transform with unobserved categories #58084

Conversation

undermyumbrella1
Copy link
Contributor

@undermyumbrella1 undermyumbrella1 commented Mar 30, 2024

@asishm
Copy link
Contributor

asishm commented Mar 30, 2024

Is there an issue linked with this?

@Aloqeely
Copy link
Member

Is there an issue linked with this?

No clue.
@undermyumbrella1 I'd appreciate an explanation of what this change accomplishes. And please make sure all the code tests pass

@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Mar 31, 2024

this is a work in progress for issue #55326 , i have added the issue number

@undermyumbrella1
Copy link
Contributor Author

ok, the pr implementation is completed

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! In addition to the issue highlighted below, I think it might be a better approach to compute the result using only observed data for transforms. Not only would that fix this issue, but it would also give a good performance gain. This is on my radar to look into and may not work out, but I think it should be tried first before other approaches. If you would like to give this a shot, please feel free!

Comment on lines 396 to 398
if remove_nan:
mask = np.zeros(shape=values.shape, dtype=bool)
result_mask = np.zeros(shape=(1, ngroups), dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just ignore mask, e.g. this gives the wrong result

data = pd.array([pd.NA, 2, 3, 4], dtype="Int64")
df = DataFrame({"key": ["a", "a", "b", "b"], "col": data})
grouped = df.groupby("key", observed=False)

print(grouped.transform("min"))
#    col
# 0    1
# 1    1
# 2    3
# 3    3

pandas/tests/groupby/transform/test_transform.py Outdated Show resolved Hide resolved
@@ -3089,6 +3139,7 @@ def min(
min_count: int = -1,
engine: Literal["cython", "numba"] | None = None,
engine_kwargs: dict[str, bool] | None = None,
**kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try very hard to avoid adding kwargs to a method for internal use.

@mroeschke mroeschke added Groupby Categorical Categorical Data Type Apply Apply, Aggregate, Transform, Map labels Apr 9, 2024
@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from be71a4d to 898fd12 Compare April 17, 2024 09:02
@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 8c1cef0 to baa1b28 Compare April 17, 2024 16:09
@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Apr 17, 2024

HI thank you for the pr review, I have changed my implementation to temporarily set observed to true (and respective groupers), so that transform will return the correct result.

I have initially tried to change the result of getattr(self, func)(*args, **kwargs), by using grouped reduce to map each result block to out_dtype that was determined in _cython_operation. However this impl turned out to be way too complicated, as the out_dtype, out_shape, views of the original value block is determined by the entire nested sequence of method calls. Extracting this logic out proved to be complicated.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking close to what I was envisioning, though more attributes appear to need to be modified than I was hoping. This introduces fragility (e.g. adding a new cached attribute could break things) and possibly hard to detect bugs (issues that would only show up if you reuse a groupby instance with two different operations in a certain order). It's still the best way I see to solve it.

Comment on lines 1890 to 1897
grouper, exclusions, obj = get_grouper(
self.orig_obj,
self.keys,
level=self.level,
sort=self.sort,
observed=True,
dropna=self.dropna,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to cache this on the groupby instance - we do not want to have to recompute it if the groupby is reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, the group by init now accepts observed_grouper, observed_exclusions params

com.temp_setattr(self, "observed", True),
com.temp_setattr(self, "_grouper", grouper),
com.temp_setattr(self, "exclusions", exclusions),
com.temp_setattr(self, "obj", obj, condition=obj_has_not_changed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we unconditionally set obj here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, removed setting obj to obj

@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from af75b3a to 30013ee Compare April 20, 2024 08:38
@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 73a6fef to 3b9d27b Compare April 20, 2024 09:48
@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have made the changes as requested

Comment on lines 591 to 592
"observed_grouper",
"observed_exclusions",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, I recommend adding it as a cached method on the BaseGrouper class in ops.py.

@cache_readonly
def observed_grouper(self):
    if all(ping._observed for ping in self.groupings):
        return self
    grouper = BaseGrouper(...)
    return grouper

For this to work, you also need to do the same to Grouping:

@cache_readonly
def observed_grouping(self):
    if self._observed:
        return self
    grouping = Grouping(...)
    return grouping

and use the observed_groupings in the BaseGrouper call above. For BinGrouper, I think you can just always return self (doesn't change behavior on to observed=True/False).

Also, you can ignore exclusions - this is independent of the grouping data stored in BaseGrouper/Grouping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I have made the changes as suggested

@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, i have made the changes as suggested

@rhshadrach
Copy link
Member

Thanks for the changes @undermyumbrella1 - this is looking good! I have some minor refactor/style requests, but I'd like to get another eye here before any more work is done.

@mroeschke - would you be able to take a look? In addition to the issue linked in the OP, this is fixing a regression caused by #55738:

N = 10**3
data = {
    "a1": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "a2": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "b": np.random.random(N),
}
df = DataFrame(data)
%timeit df.groupby(["a1", "a2"], observed=False)["b"].transform("sum")
# 6.83 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- main
# 687 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR

While it's undesirable to swap out the grouper as is done here, I do not see any better way. There may be more efficient ways of computed the observed codes / result_index, but that can be readily built upon this later on.

@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have updated the pr according to comments.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few style requests, otherwise looks great!



# GH#58084
def test_min_multiple_unobserved_categories_no_type_coercion():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant to me - I think the above test is sufficient here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved



# GH#58084
def test_min_float32_multiple_unobserved_categories_no_type_coercion():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead parametrize test_min_one_unobserved_category_no_type_coercion. Something like

@pytest.mark.parametrize("dtype", ["int32", "float32"])
def test_min_one_unobserved_category_no_type_coercion(dtype):
    ...
    df["B"] = df["B"].astype(dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment on lines 1650 to 1660
categories=[
1,
"randomcat",
100,
333,
"cat43543",
-4325466,
54665,
-546767,
"432945",
767076,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need for so many here - can you make it 1-3 categories (so the test is more compact).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None):
elif ndim == 1:
if subset is None:
subset = self.obj[key]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this line addition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears in the diff of this PR.

@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 49f5a1e to f3a3f63 Compare May 2, 2024 03:13
@undermyumbrella1 undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 64aa8cd to 58e759f Compare May 2, 2024 03:25
@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have updated the pr according to comments.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good - just some unintentional changes to core/generic.py and core/groupby/generic.py - I think you deleted a line from the former instead of the latter 😄

Also - a note about force pushing. Force pushing on your PR is okay, but do know it can make review a little harder. Namely, when you force push the "Show changes since your last review" option no longer works.

@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None):
elif ndim == 1:
if subset is None:
subset = self.obj[key]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears in the diff of this PR.

@@ -2055,7 +2055,6 @@ def __setstate__(self, state) -> None:
object.__setattr__(self, "_attrs", attrs)
flags = state.get("_flags", {"allows_duplicate_labels": True})
object.__setattr__(self, "_flags", Flags(self, **flags))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this line removal. Shouldn't have any diff in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have updated the pr according to comments. Noted on force pushing

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added Bug Performance Memory or execution speed performance labels May 8, 2024
@rhshadrach rhshadrach changed the title Use mask to create result_mask that filters nan categories BUG/PERF: Use mask to create result_mask that filters nan categories May 8, 2024
@rhshadrach rhshadrach changed the title BUG/PERF: Use mask to create result_mask that filters nan categories BUG/PERF: groupby.transform with unobserved categories May 8, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone May 8, 2024
@rhshadrach rhshadrach merged commit 8d543ba into pandas-dev:main May 8, 2024
52 checks passed
@rhshadrach
Copy link
Member

Thanks @undermyumbrella1 - very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.transform with a reducer and unobserved categories coerces dtype
5 participants