-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/ENH: Use pyarrow grouped aggregation functions for pyarrow-backed groupby ops #55234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, will run ASVs tomorrow.
pandas/core/arrays/arrow/array.py
Outdated
pa.Scalar or pa.ChunkedArray | ||
""" | ||
pa_type = self._pa_array.type | ||
cast_kwargs = {"safe": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pass safe=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - more or less a carryover from the previous implementation. Seems a little odd to do in the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, it was carryover - I've updated it.
I'm seeing a bit more drastic performance regression:
|
hmmm, thats too bad. The test added here ( Here are some of the grouped reductions that fail that test on main:
|
|
||
res = ( | ||
pa.Table.from_pydict({"id": ids, "val": arr}) | ||
.group_by("id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is re-calculating the codes we already have in our GroupBy object. I see the appeal, but am wary of this pattern. I think the long-term solution is to get pyarrow to expose something to which we can pass the already-computed codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's fair. Also, it would be nice to pass all aggregations at once to pyarrow - see the last comment in the OP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term, there are potentially big performance gains if all arrow-backed aggregations can be passed to pyarrow at once. At the moment, we're iterating column by column. Here is a small example showing the impact:
Any idea what is driving the perf impact? Does the pyarrow version do something in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell from reading the docs whether anything is done in parallel. It could be simply (as you've pointed out) that pyarrow is only computing the group codes once when done in a single batch and recomputing them when done iteratively.
cc @rhshadrach, @jbrockmendel, @mroeschke Should I close this given the performance concerns or do you think we should retain some of it, e.g. use this for pyarrow decimal types. |
I'm going to rerun my benchmarks in the next day or so to be sure. Do we have a good understanding of where the perf regressions are specifically coming from? |
Here are two suspicions:
This is behavior from this branch (not main) - note the output type:
This is consistent with pyarrow non-grouped behavior:
|
#53158 would get us part of the way there for that. Getting the rest of the way there would require more gymnastics than I think we should do.
Might be worth discussing separately, but my knee-jerk reaction is to avoid that value-dependent behavior. The main perf upside seems to be for decimal, which is a pretty low priority. I'd like to look for ways to address the bugs identified here without calling the pa.Table methods, and longer run try to get pyarrow to expose something that lets us pass the codes. |
|
||
return data_to_reduce | ||
|
||
def _maybe_cast_reduction_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC at least some of the bugfix is coming from adding this casting to _groupby_op. Could split off from this PR one that implements _values_for_reduction and _maybe_cast_reduction_result?
Same results approximately.
|
Thanks for reviews. I'll close this and maybe split off some of the smaller pieces. |
doc/source/whatsnew/v2.2.0.rst
file if fixing a bug or adding a new feature.Use pyarrow's
TableGroupBy
functionality for pyarrow-backed groupby aggregations.This has a few benefits over the current approach of round-tripping through non-arrow dtypes:
Performance impact is mixed. I think the consistency benefits mentioned above outweigh the performance impact here. If others disagree, we could continue using the pandas implementation for specific aggregations.
Note: These timings are from a slow laptop. It would be great if someone could run this ASV to confirm (@rhshadrach possibly?)
asv continuous -f 1.1 upstream/main arrow-groupby -b groupby.GroupByAggregateArrowDtypes
Longer term, there are potentially big performance gains if all arrow-backed aggregations can be passed to pyarrow at once. At the moment, we're iterating column by column. Here is a small example showing the impact: