Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DOC: clean up DataFrame.groupby.apply #22545

Open
h-vetinari opened this issue Aug 30, 2018 · 4 comments
Open

API/DOC: clean up DataFrame.groupby.apply #22545

h-vetinari opened this issue Aug 30, 2018 · 4 comments
Labels
API Design Apply Apply, Aggregate, Transform, Map Docs Groupby

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Aug 30, 2018

I'm very often working with df.groupby.apply(), and there are many confusing (sometimes wrong) aspects about the behaviour of the output, particularly regarding what happens with the index of the output. v.0.23 cleaned up big parts of the apply API, but there's still a lot left...

Ideally, I wish there'd be a sort of matrix (not necessarily in the following form) in the documentation - and implemented by the API - along the following lines

For as_index=True:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |    groupby-columns    |      n/a       |  none
Series            |   DataFrame   |    groupby-columns    |     dropped    |  index (union) of Series
DataFrame         |   DataFrame   |   gb-cols + df.index  |     dropped    |  columns (union) of DFs
np.ndarray 1-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
np.ndarray 2-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
Index             |  MultiIndex?  |   gb-cols + output    |      n/a       |  n/a

For as_index=False:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |   DataFrame?  |      RangeIndex       |      n/a       |  gb-cols + output?
Series            |   DataFrame   |      RangeIndex       |      kept      |  gb-cols + index of Series?
DataFrame         |   DataFrame   |  to dicuss / raise ?  |      kept      |  gb-cols + columns of DFs
np.ndarray 1-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
np.ndarray 2-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
Index             |    Series?    |  to dicuss / raise ?  |      n/a       |  n/a

Currently, the behaviour is much, much more complicated / inconsistent / wrong. I'm trying to fill corresponding tables with the current behaviour and some issue xrefs, but it's by far not complete yet:

For as_index=True:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |    groupby-columns    |      n/a       |  none
Series (same idx) |   DataFrame   |    groupby-columns    |     kept?!     |  index of Series
Series (diff idx) |    Series?!   |  gb-cols + output.idx |      n/a       |  none?!
group as-is       |   DataFrame   |    original index?!   |     kept?!     |  original columns
group selection   |   DataFrame   |  gb-cols + output.idx |     kept?!     |  original columns
DataFrame         |   DataFrame   |  gb-cols + output.idx |      n/a       |  columns (union) of DFs
np.ndarray 1-dim  |    Series?!   |   groupby-columns     |      n/a       |  none
np.ndarray 2-dim  |    Series?!   |   groupby-columns     |      n/a       |  none
Index             |    Series?!   |   groupby-columns     |      n/a       |  none #22541

For as_index=False:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |      RangeIndex       |      n/a       |  none
Series (same idx) |   DataFrame   |      RangeIndex       |     kept       |  index of Series
Series (diff idx) |    Series?!   | RngIdx + output.idx?! |      n/a       |  none?!
group as-is       |   DataFrame   |    original index?!   |     kept       |  original columns
group selection   |   DataFrame   | RngIdx + output.idx?! |     kept       |  original columns
DataFrame         |   DataFrame   | RngIdx + output.idx?! |      n/a       |  columns (union) of DFs
np.ndarray 1-dim  |    Series?!   |      RangeIndex       |      n/a       |  none
np.ndarray 2-dim  |    Series?!   |      RangeIndex       |      n/a       |  none
Index             |    Series?!   |      RangeIndex       |      n/a       |  none #22541

Some xrefs: #20420, #22541, #22542, #22546

@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2018

@h-vetinari : Very thorough of you to do this! Keep at it!

@h-vetinari
Copy link
Contributor Author

@gfyoung Thanks. Could you maybe add the tags for groupby and API Design as well, please? The groupby-issues especially are so numerous that it's easier to only search within the tagged ones...

@jbrockmendel
Copy link
Member

@h-vetinari a lot of cleanup in/around groupby.apply has occurred recently. can you see if any of the bugs/inconsistencies here have been addressed?

@jocelynjyl
Copy link

Similar to #20420, I see inconsistent output with groupby+apply+diff. If the groupby results in exactly 2 records, the output from applying diff is a transposed DataFrame. Otherwise, if there are more than 2 records in the groupby, the output is a Series.

I can make a separate issue with code snippet if needed but I think my issue might be fixed if #20420 is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map Docs Groupby
Projects
None yet
Development

No branches or pull requests

4 participants