DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

JoostvanPinxten · 2024-01-18T08:19:53Z

Edit[rhshadrach]: The original report about inconsistent skipna arguments in groupby is captured in #15675. For this issue, it suffices to add notes to the documentation about the default incompatibility between pandas and NumPy as described below.

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

np_series = [
    np.array(['A','A','A', 'B', 'B']), 
    np.array([1, 2, 3, 4, 5]), 
    np.array([np.nan, 2, 3, 4, 5])
]

# NumPy behavior
display(np.std(np_series[1], ddof=1)) # 1.5811388300841898
display(np.std(np_series[2], ddof=1)) # nan

# Pandas DataFrame behavior
df = pd.DataFrame(dict(zip(['Group', 'Val1', 'Val2'], np_series)))
display(df[['Val1', 'Val2']].std().to_frame())
display(df[['Val1', 'Val2']].std(skipna=False).to_frame())

# Pandas Series behavior
display(pd.Series(np_series[1]).std())
display(pd.Series(np_series[2]).std())
display(pd.Series(np_series[2]).std(skipna=False)) # equivalent to the default numpy behavior when called with ddof=1

# GroupBy behavior
display(df.groupby('Group').std()) # no nans reported, no option to pass skipna parameter

Issue Description

I've wasted quite some time trying to find out why in some edge cases (on a large dataset) a different value was reported by DataFrame.std vs numpy.std. The posts and documentation I've found focus on the DDOF parameter only. It turned out to be related to fewer NaN values being reported.

Expected Behavior

ChatGPT hinted to me that the Pandas Series has a skipna=True argument. This is different from the numpy.std, which does NOT skip na by default. And even has no such optional argument.

I would suggest to:

Add documentation to all DataFrame, Series and GroupBy classes regarding this different default behavior.
Add the skipna=True argument to the Series.std and GroupBy.std methods

I can contribute this patch if this is indeed the right way to go.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-372.19.1.el8_6.x86_64
Version : #1 SMP Tue Aug 2 16:19:42 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : 1.1.0
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
snappy : None
sqlalchemy : 1.4.49
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-01-18T21:50:24Z

Thanks for the report. Agreed on adding the skipna=False to Series/DataFrame methods where we talk about being compatible with NumPy. The inconsistency of the presence of skipna in groupby is not desirable. I'd be in favor of adding it everywhere (as opposed to removing it everywhere), but I think this needs some more discussion.

cc @jbrockmendel @mroeschke @jorisvandenbossche

rhshadrach · 2024-01-18T22:35:05Z

Related: #15675

jorisvandenbossche · 2024-01-18T23:34:19Z

The current docstring has a note:

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

That could indeed definitely be updated to mention this is only the case if there are no missing values (or that you have to specify skipna=False), or in general more clearly warn about the differences with np.std

Ahmedniz1 · 2024-01-19T13:18:21Z

Is this task available, or is it in discussion as of now?
I'd like to take the task once the decision is taken about adding the skipna=False option.
Thanks

JoostvanPinxten · 2024-01-19T13:55:56Z

display(np.mean([np.nan,2,3,4,5])) # --> nan
display(pd.Series([np.nan,2,3,4,5]).mean()) # --> 3.5

The mean function has the same skipna=True issue as std.

And the GroupBy also does not support the skipna argument

I think you may want to add these in a consistent manner over all GroupBy methods...?

JoostvanPinxten · 2024-01-19T13:59:51Z

What is the stance on being compatible to NumPy? Should Pandas follow that where possible?

NB: I would argue that changing the default setting of the parameter should be considered a breaking change in the API. I presume that a lot of scripts already assume that skipna=True is indeed the default behavior now.

rhshadrach · 2024-01-19T21:30:34Z

@Ahmedniz1

Is this task available, or is it in discussion as of now?

I think adding skipna to groupby operations needs more discussion. The addition to the documentation of skipna in Series/DataFrame methods for compatibility with NumPy would be very much welcome!

@JoostvanPinxten

What is the stance on being compatible to NumPy? Should Pandas follow that where possible?

I think that is too strong. I view compatibility with NumPy a beneficial feature, but must be weighed against other impacts of changes. In this instance, I am against changing the default of skipna to agree with NumPy, at least just for the sake of compatibility.

torext · 2024-01-22T14:45:14Z

NB: I would argue that changing the default setting of the parameter should be considered a breaking change in the API. I presume that a lot of scripts already assume that skipna=True is indeed the default behavior now.

I very much agree with this. Pandas is already a sort of convenience wrapper around NumPy, adding additional abstractions like coordinate names etc. Having NaNs ignored when computing standard deviation by default seems desireable in most cases, and even NumPy has a separate function indeed for handling NaNs in this case (np.nanstd). IMO it's safe to just assume that Pandas implements by default the more convenient of two options.

Please do not yet again change default behaviour just because someone was bored on a Sunday afternoon and had an epiphany; people write massive libraries based on Pandas and its default behaviours are often taken into account for the sake of code-readability and -brevity. skipna=False is documented and anyone who wants that level of strictness can easily enable it already that way.

I do agree that the documentation giving instructions for replicating NumPy behaviour are incomplete; that's IMO all that should be fixed here.

jbrockmendel · 2024-01-22T16:41:10Z

+1 for adding the argument to groupby.std for consistency.

rhshadrach · 2024-01-26T20:11:37Z

I've updated #15675 for tracking adding skipna; reworking this as purely a documentation issue.

JoostvanPinxten · 2024-01-28T08:32:17Z

Sounds good to me. I hope to have some time tonight, but I am also OK if someone else picks this up.

Ahmedniz1 · 2024-01-28T16:32:39Z

Can I get to know where to highlight the difference exactly in the documentation.
I'd love to take this up.

JoostvanPinxten · 2024-02-14T12:18:43Z

I only know the resulting documentation, not where the sources are. There are quite a few places where e.g. std is used. A few samples:

GroupBy std
DataFrameGroupBy std
SeriesGroupBy std

It should contain something similar to:
DataFrame std

Regarding the skipna argument: "Exclude NA/null values. If an entire row/column is NA, the result will be NA."

Regarding the note about ddof and consistency with numpy, see an example here:

But there are other operations where skipna is also relevant and perhaps non-default in numpy. I do not know how to best approach this in the code-base though.

JoostvanPinxten added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2024

JoostvanPinxten changed the title ~~BUG: std method of DataFrame does not accept skipna as optional kwarg, but Series does~~ BUG: inconsistent skipna arguments for std method of DataFrame/pd.Series/numpy Jan 18, 2024

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2024

This was referenced Jan 24, 2024

REGR: groupby.idxmin/idxmax wrong result on extreme values #57046

Merged

API: Add skipna to groupby ops #57095

Closed

rhshadrach added Docs and removed Bug Needs Discussion Requires discussion from core team before further action labels Jan 26, 2024

rhshadrach changed the title ~~BUG: inconsistent skipna arguments for std method of DataFrame/pd.Series/numpy~~ DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

JoostvanPinxten commented Jan 18, 2024 •

edited by rhshadrach

Loading

INSTALLED VERSIONS

rhshadrach commented Jan 18, 2024 •

edited

Loading

rhshadrach commented Jan 18, 2024

jorisvandenbossche commented Jan 18, 2024 •

edited

Loading

Ahmedniz1 commented Jan 19, 2024 •

edited

Loading

JoostvanPinxten commented Jan 19, 2024 •

edited

Loading

JoostvanPinxten commented Jan 19, 2024 •

edited

Loading

rhshadrach commented Jan 19, 2024

torext commented Jan 22, 2024

jbrockmendel commented Jan 22, 2024

rhshadrach commented Jan 26, 2024

JoostvanPinxten commented Jan 28, 2024

Ahmedniz1 commented Jan 28, 2024

JoostvanPinxten commented Feb 14, 2024

DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

Comments

JoostvanPinxten commented Jan 18, 2024 • edited by rhshadrach Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Jan 18, 2024 • edited Loading

rhshadrach commented Jan 18, 2024

jorisvandenbossche commented Jan 18, 2024 • edited Loading

Ahmedniz1 commented Jan 19, 2024 • edited Loading

JoostvanPinxten commented Jan 19, 2024 • edited Loading

JoostvanPinxten commented Jan 19, 2024 • edited Loading

rhshadrach commented Jan 19, 2024

torext commented Jan 22, 2024

jbrockmendel commented Jan 22, 2024

rhshadrach commented Jan 26, 2024

JoostvanPinxten commented Jan 28, 2024

Ahmedniz1 commented Jan 28, 2024

JoostvanPinxten commented Feb 14, 2024

JoostvanPinxten commented Jan 18, 2024 •

edited by rhshadrach

Loading

rhshadrach commented Jan 18, 2024 •

edited

Loading

jorisvandenbossche commented Jan 18, 2024 •

edited

Loading

Ahmedniz1 commented Jan 19, 2024 •

edited

Loading

JoostvanPinxten commented Jan 19, 2024 •

edited

Loading

JoostvanPinxten commented Jan 19, 2024 •

edited

Loading