Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

flipdazed · 2017-03-13T18:06:13Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
d = {'l':  ['left', 'right', 'left', 'right', 'left', 'right'],
     'r': ['right', 'left', 'right', 'left', 'right', 'left'],
     'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)

Problem description

When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum

NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum

In [235]: df.v.sum(skipna=False)
Out[235]: nan

However, this behavior is not reflected in the pandas.DataFrame.groupby object

In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0

and cannot be forced by applying the np.sum method directly

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0

see this StackOverflow post for a workaround

Expected Output

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan

and

In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-13T18:21:09Z

np.sum is translated directly to pandas sum. so this is as expected.

if you really really want this behavior of np.sum. This is not useful in any way IMHO.

In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan

or

In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]: 
l
left    -3.0
right    NaN

The passthru skipna parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).

jreback · 2017-03-13T18:23:28Z

xref #15675

flipdazed · 2017-03-13T18:33:37Z

This is not useful in any way IMHO.

consider the following

merge two dataframes together from an SQl database
carry out data manipulation to get an aggregated figure per some index values that is dependent on all values being present

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

Either way, I agree with your assessment that it is inconsistent with current implementation as skipna should be a kwarg - many thanks for creating the new issue

jreback · 2017-03-13T18:44:24Z

@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter on groupby, or simply .fillna). but that is far less common than simply aggregating.

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

you really think so? sounds like you don't handle missing data at all.

jorisvandenbossche · 2017-03-13T23:19:15Z

you really think so?

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

jreback · 2017-03-13T23:26:20Z

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.

ajosanchez · 2018-03-14T21:09:56Z

It would be nice to have a keyword and get those NAs back in this case:
groupby(['x']).resample('D').sum()

When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 13, 2017

jreback added this to the No action milestone Mar 13, 2017

jreback added the Usage Question label Mar 13, 2017

jreback closed this as completed Mar 13, 2017

jreback mentioned this issue Mar 13, 2017

ENH: enable skipna on groupby reduction ops #15675

Open

9 tasks

pwwang mentioned this issue Oct 4, 2021

mean() with option na_rm=False does not work pwwang/datar#65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

flipdazed commented Mar 13, 2017

jreback commented Mar 13, 2017

jreback commented Mar 13, 2017

flipdazed commented Mar 13, 2017 •

edited

Loading

jreback commented Mar 13, 2017 •

edited

Loading

jorisvandenbossche commented Mar 13, 2017

jreback commented Mar 13, 2017

ajosanchez commented Mar 14, 2018

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ? #15674

Comments

flipdazed commented Mar 13, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Mar 13, 2017

jreback commented Mar 13, 2017

flipdazed commented Mar 13, 2017 • edited Loading

jreback commented Mar 13, 2017 • edited Loading

jorisvandenbossche commented Mar 13, 2017

jreback commented Mar 13, 2017

ajosanchez commented Mar 14, 2018

Output of `pd.show_versions()`

flipdazed commented Mar 13, 2017 •

edited

Loading

jreback commented Mar 13, 2017 •

edited

Loading