Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate: level parameter for aggregations in DataFrame and Series #39983

Closed
devin-petersohn opened this issue Feb 23, 2021 · 12 comments · Fixed by #40869
Closed

Deprecate: level parameter for aggregations in DataFrame and Series #39983

devin-petersohn opened this issue Feb 23, 2021 · 12 comments · Fixed by #40869
Labels
Deprecate Functionality to remove in pandas Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@devin-petersohn
Copy link
Contributor

For sum, prod, count, etc. there is a parameter level which is internally rewritten into groupby.

I propose that the level parameter for these aggregations be deprecated and instead recommend to users that they use the groupby syntax.

The functions that would be affected are:

  • any
  • all
  • count
  • sum
  • prod
  • max
  • min
  • mean
  • median
  • skew
  • kurt
  • sem
  • var
  • std
  • mad

I believe this list is comprehensive.

@devin-petersohn
Copy link
Contributor Author

I'd be happy to do this, but it will take me a little time. Are these docs up to date? The docs don't cover how to handle deprecating an argument from what I found here, are there special instructions for handling that or a past commit I should use as a guideline?

cc @jreback

@jreback
Copy link
Contributor

jreback commented Feb 23, 2021

instructions which u linked are accurate

@rhshadrach rhshadrach added Deprecate Functionality to remove in pandas Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 23, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Feb 23, 2021
@jorisvandenbossche
Copy link
Member

I am personally fine with deprecating this keyword (I never use(d) myself in practice).

Pinging some more people since it's a quite prominent keyword affecting many functions @pandas-dev/pandas-core @pandas-dev/pandas-triage

@toobaz
Copy link
Member

toobaz commented Feb 25, 2021

I would be +1 if we were removing a seldomly used keyword in exchange for a significant simplification in the code; but my understanding is that we would be removing syntactic sugar which is implemented in a relatively straightforward way. Not sure this is worth the potential disruption to anybody who is using (and liking) this. I myself would be tempted to use it... now I know it exists ;-)

@datapythonista
Copy link
Member

+1 to deprecate. Seems like a nice API clean up, and I think the zen of Python makes a lot of sense when it says that there should be one and preferrably only on obvious way to do things.

@jnothman
Copy link
Contributor

2c: never noticed these, so groupby is probably the "one right way to do it". However, the bloat in Pandas IMO more lies in a redundancy of verbs rather than parameters, so I'm not convinced that you gain a whole lot by this.

One key question is: how clear is df.sum(level='a') to the novice reader?

I also note that the docs are currently wrong in pd.DataFrame.sum etc: count along a particular level, collapsing into a Series.. It collapses into a DataFrame. It's also inaccurate in type, since you can pass a list of levels. So it seems that consistently documenting this kind of syntactic sugar can be hard.

@toobaz
Copy link
Member

toobaz commented Feb 27, 2021

I also note that the docs are currently wrong in pd.DataFrame.sum

Right. I think the best way to document syntactic sugar is to typically just describe what it replaces. E.g. "equivalent to df.groupby(level).sum()" - then clearly the type description should coincide with that from groupby.

@phofl
Copy link
Member

phofl commented Apr 2, 2021

#40660 is a good argument pro deprecating this.

@TheNeuralBit
Copy link
Contributor

#40788 looks like another

@phofl
Copy link
Member

phofl commented Apr 5, 2021

This is actually a duplicate of #40660

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Apr 12, 2021
@geoffrey-eisenbarth
Copy link
Contributor

Apologies if this is the wrong place to discuss this, please let me know and I'll gladly move it around. I only recently updated to 1.3 and noticed the FutureWarning related to this change. I would like to request that, in light of this change, that groupby.sum() be modified to take the skipna parameter. See below for my use case.

Setup:

df = pd.DataFrame(
  data=1,
  columns=['Col_1, 'Col_2', 'Col_3'],
  index=pd.MultiIndex.from_product(
    iterables=[
      ['001', '002', '003', '004', '005'],
      pd.date_range('2001', '2005', freq='A'),
    ],
    names=['region', 'date'],
  ),
)

# Set some data equal to NaN to demonstrate issue below
df.loc['002', :] = np.nan

regions = {
  '001': 'Region A',
  '002': 'Region A',
  '003': 'Region A',
  '004': 'Region B',
  '005': 'Region B',
}

Prior to 1.3 I would use the following to sum up data from smaller regions ('001', '002', etc) to bigger regions (A, B):

df.groupby(regions, level='region')
  .apply(lambda g: g.sum(level='date', skipna=False))

which, as of 1.3 I suppose should become something like

df.groupby(regions, level='region')
  .apply(lambda g: (
     g.groupby(level='date').sum(skipna=False)
  ))

HOWEVER, groupby(...).sum(skipna=False) raises TypeError: sum() got an unexpected keyword argument 'skipna', which means that the equivalent code is now

df.groupby(regions, level='region')
  .apply(lambda g: (
      g.groupby(level='date')
       .apply(lambda h: h.sum(skipna=False))
      )
   )

Am I missing something? Is there an easier way to do this? Thanks!

@mzeitlin11
Copy link
Member

I think this is a reasonable request - there's #15675 for more context (and an open, but stalled pr at #41399)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.