Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: groupby aggregation with apply does not drop groupby-column #22542

Closed
h-vetinari opened this issue Aug 30, 2018 · 6 comments
Closed

API: groupby aggregation with apply does not drop groupby-column #22542

h-vetinari opened this issue Aug 30, 2018 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Aug 30, 2018

The docs for groupby say (http://pandas.pydata.org/pandas-docs/stable/groupby.html):

Note:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

From the section, it's implied that this is talking about builtins and the aggregate functionality, but I very often find myself operating with complicated functions on the groups themselves, so apply is my bread and butter (and this is part of a larger issue that groupby.apply has some inconsistent behavior).

N = 10
df = pd.DataFrame(index=range(N), columns=['id', 'x', 'y', 'z'])
df.loc[:, ['x', 'y', 'z']] = np.arange(N*3).reshape(N, 3)
df.id = np.random.randint(0, int(N/3), (N,)) + 10
df
#    id   x   y   z
# 0  12   0   1   2
# 1  12   3   4   5
# 2  11   6   7   8
# 3  10   9  10  11
# 4  12  12  13  14
# 5  12  15  16  17
# 6  12  18  19  20
# 7  11  21  22  23
# 8  10  24  25  26
# 9  10  27  28  29

For something like sum, the groupby-column gets dropped, as described:

df.groupby('id').sum()
#      x   y   z
# id            
# 10  60  63  66
# 11  27  29  31
# 12  48  53  58

But for using the same function in apply, the result is different - mainly that the groupby column does not get removed (but also the dtype)

df.groupby('id', as_index=True).apply(lambda gr: gr.sum())
#       id     x     y     z
# id                        
# 10  30.0  60.0  63.0  66.0
# 11  22.0  27.0  29.0  31.0
# 12  60.0  48.0  53.0  58.0

Ideally, I'd like the make the behaviour of groupby.apply more consistent in a number of cases, and this is one of them.

@h-vetinari h-vetinari changed the title API: groupby custom aggregation behaves differently than builtins API: groupby aggregation with apply does not drop groupby-column Aug 30, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2018

Related to #20420 - we generally have a few inconsistencies in apply that need to be cleaned up

@h-vetinari
Copy link
Contributor Author

@WillAyd

Related to #20420 - we generally have a few inconsistencies in apply that need to be cleaned up

Started collecting some of them in #22545

@jreback
Copy link
Contributor

jreback commented Aug 30, 2018

rather opening new issues pls look at open existing ones

@h-vetinari
Copy link
Contributor Author

@jreback

rather opening new issues pls look at open existing ones

I did (https://github.com/pandas-dev/pandas/issues?page=2&q=is%3Aissue+is%3Aopen+apply+label%3AGroupby&utf8=%E2%9C%93), but did not find much - guess I did not go back far enough in time - sorry.

Going over them a second time, I did overlook #13217, #15290 and possibly #18103 is somewhat related. I don't think there's something as comprehensive as what I'm trying to summarize in #22545, but #13056 is a start.

@simonjayhawkins
Copy link
Member

closing as duplicate of #13217. ping me to reopen if I'm missing something.

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed Groupby labels Apr 24, 2020
@h-vetinari
Copy link
Contributor Author

Fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants