Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

Closed
flipdazed opened this issue Mar 13, 2017 · 7 comments
Closed

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

flipdazed opened this issue Mar 13, 2017 · 7 comments
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question

Comments

@flipdazed
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
d = {'l':  ['left', 'right', 'left', 'right', 'left', 'right'],
     'r': ['right', 'left', 'right', 'left', 'right', 'left'],
     'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)

Problem description

When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum

NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum

In [235]: df.v.sum(skipna=False)
Out[235]: nan

However, this behavior is not reflected in the pandas.DataFrame.groupby object

In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0

and cannot be forced by applying the np.sum method directly

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0

see this StackOverflow post for a workaround

Expected Output

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan

and

In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

np.sum is translated directly to pandas sum. so this is as expected.

if you really really want this behavior of np.sum. This is not useful in any way IMHO.

In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan

or

In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]: 
l
left    -3.0
right    NaN

The passthru skipna parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).

@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 13, 2017
@jreback jreback added this to the No action milestone Mar 13, 2017
@jreback jreback closed this as completed Mar 13, 2017
@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

xref #15675

@flipdazed
Copy link
Author

flipdazed commented Mar 13, 2017

This is not useful in any way IMHO.

consider the following

  • merge two dataframes together from an SQl database
  • carry out data manipulation to get an aggregated figure per some index values that is dependent on all values being present

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

Either way, I agree with your assessment that it is inconsistent with current implementation as skipna should be a kwarg - many thanks for creating the new issue

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter on groupby, or simply .fillna). but that is far less common than simply aggregating.

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

you really think so? sounds like you don't handle missing data at all.

@jorisvandenbossche
Copy link
Member

you really think so?

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.

@ajosanchez
Copy link

It would be nice to have a keyword and get those NAs back in this case:
groupby(['x']).resample('D').sum()

When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants