-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.groupby().sum() treating Nan as 0.0 #20824
Comments
I think you want In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64
In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a b
data1 2 NaN
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64 |
This is a bit surprising In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64 Something strange w/ the |
Thanks! Did not think of removing skipna=False. skipna behavior should be consistent. |
I think there are two intertwined issues
In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64 so passing |
#15675 for the skipna part. |
String values trigger the fallback too:
I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):
That |
Does this issue still need to be resolved. If so I'd like to look into this. |
Yes please.
From: mukundm19 <[email protected]>
Sent: Monday, April 8, 2019 12:52 PM
To: pandas-dev/pandas <[email protected]>
Cc: Handa, Aman <[email protected]>; Author <[email protected]>
Subject: [EXT] Re: [pandas-dev/pandas] DataFrame.groupby().sum() treating Nan as 0.0 (#20824)
Does this issue still need to be resolved. If so I'd like to look into this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_20824-23issuecomment-2D480935840&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=Cbs63AslxT5mes6buZUkygBOYrfofo4qiL8xzyR0PWs&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEQYhFt3PrCNpLwawINPWjBLhruFX8bYks5ve4GxgaJpZM4TkIlb&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=6DYjKZwrpvONq__spnUOuiStJWdIJOBi8rKV1nvEUlo&e=>.
…________________________________
CONFIDENTIALITY AND SECURITY NOTICE
The contents of this message and any attachments may be confidential and proprietary. If you are not an intended recipient, please inform the sender of the transmission error and delete this message immediately without reading, distributing or copying the contents.
|
As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero. |
I'm using latest v1.0.1 but still see this issue. Also the
|
Still an issue in v1.0.3 df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
col1 col2
0 a NaN
1 a 2.0
2 b NaN
3 c 3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
sum_col2 mean_col2
col1
a 2.0 2.0
b 0.0 NaN
c 3.0 3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan Therefore, I would expect df_2 to be sum_col2 mean_col2
col1
a NaN NaN
b NaN NaN
c 3.0 3.0 Same unexpected result with df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean)) Also the min_count=1 suggestion does not solve the problem, for example df_4 = pd.DataFrame({
'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'),
'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
col2
col1
a 2.0
b NaN
c 3.0
d 9.0
e 6.0 where I where expect df_5 to be col2
col1
a NaN
b NaN
c 3.0
d NaN
e NaN Also problems with std, but that seems more confusing. pd.__version__
'1.0.3'
np.__version__
'1.18.4' |
I might try a Pull Request to solve this. I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby. Then, I assign numpy.nan to everything that resulted in numpy.inf. There's an example I posted in this stackoverflow discussion: |
If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:
|
The actual improvement to add |
Not sure what common tricky case is referring to here, is it just specifying |
Specifying |
@y4n9squared - that issue is #15675. If that is all that remains here, this issue can be closed. This is why I asked my question about documentation. |
Code Sample, a copy-pastable example if possible
Problem description
The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: