Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939

Open
3 tasks done
JoostvanPinxten opened this issue Jan 18, 2024 · 13 comments
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@JoostvanPinxten
Copy link

JoostvanPinxten commented Jan 18, 2024

Edit[rhshadrach]: The original report about inconsistent skipna arguments in groupby is captured in #15675. For this issue, it suffices to add notes to the documentation about the default incompatibility between pandas and NumPy as described below.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

np_series = [
    np.array(['A','A','A', 'B', 'B']), 
    np.array([1, 2, 3, 4, 5]), 
    np.array([np.nan, 2, 3, 4, 5])
]

# NumPy behavior
display(np.std(np_series[1], ddof=1)) # 1.5811388300841898
display(np.std(np_series[2], ddof=1)) # nan

# Pandas DataFrame behavior
df = pd.DataFrame(dict(zip(['Group', 'Val1', 'Val2'], np_series)))
display(df[['Val1', 'Val2']].std().to_frame())
display(df[['Val1', 'Val2']].std(skipna=False).to_frame())

# Pandas Series behavior
display(pd.Series(np_series[1]).std())
display(pd.Series(np_series[2]).std())
display(pd.Series(np_series[2]).std(skipna=False)) # equivalent to the default numpy behavior when called with ddof=1

# GroupBy behavior
display(df.groupby('Group').std()) # no nans reported, no option to pass skipna parameter

Issue Description

I've wasted quite some time trying to find out why in some edge cases (on a large dataset) a different value was reported by DataFrame.std vs numpy.std. The posts and documentation I've found focus on the DDOF parameter only. It turned out to be related to fewer NaN values being reported.

Expected Behavior

ChatGPT hinted to me that the Pandas Series has a skipna=True argument. This is different from the numpy.std, which does NOT skip na by default. And even has no such optional argument.

I would suggest to:

  1. Add documentation to all DataFrame, Series and GroupBy classes regarding this different default behavior.
  2. Add the skipna=True argument to the Series.std and GroupBy.std methods

I can contribute this patch if this is indeed the right way to go.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-372.19.1.el8_6.x86_64
Version : #1 SMP Tue Aug 2 16:19:42 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : 1.1.0
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
snappy : None
sqlalchemy : 1.4.49
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@JoostvanPinxten JoostvanPinxten added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2024
@JoostvanPinxten JoostvanPinxten changed the title BUG: std method of DataFrame does not accept skipna as optional kwarg, but Series does BUG: inconsistent skipna arguments for std method of DataFrame/pd.Series/numpy Jan 18, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Jan 18, 2024

Thanks for the report. Agreed on adding the skipna=False to Series/DataFrame methods where we talk about being compatible with NumPy. The inconsistency of the presence of skipna in groupby is not desirable. I'd be in favor of adding it everywhere (as opposed to removing it everywhere), but I think this needs some more discussion.

cc @jbrockmendel @mroeschke @jorisvandenbossche

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2024
@rhshadrach
Copy link
Member

Related: #15675

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 18, 2024

The current docstring has a note:

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

That could indeed definitely be updated to mention this is only the case if there are no missing values (or that you have to specify skipna=False), or in general more clearly warn about the differences with np.std

@Ahmedniz1
Copy link

Ahmedniz1 commented Jan 19, 2024

Is this task available, or is it in discussion as of now?
I'd like to take the task once the decision is taken about adding the skipna=False option.
Thanks

@JoostvanPinxten
Copy link
Author

JoostvanPinxten commented Jan 19, 2024

display(np.mean([np.nan,2,3,4,5])) # --> nan
display(pd.Series([np.nan,2,3,4,5]).mean()) # --> 3.5

The mean function has the same skipna=True issue as std.

And the GroupBy also does not support the skipna argument

I think you may want to add these in a consistent manner over all GroupBy methods...?

@JoostvanPinxten
Copy link
Author

JoostvanPinxten commented Jan 19, 2024

What is the stance on being compatible to NumPy? Should Pandas follow that where possible?

NB: I would argue that changing the default setting of the parameter should be considered a breaking change in the API. I presume that a lot of scripts already assume that skipna=True is indeed the default behavior now.

@rhshadrach
Copy link
Member

@Ahmedniz1

Is this task available, or is it in discussion as of now?

I think adding skipna to groupby operations needs more discussion. The addition to the documentation of skipna in Series/DataFrame methods for compatibility with NumPy would be very much welcome!

@JoostvanPinxten

What is the stance on being compatible to NumPy? Should Pandas follow that where possible?

I think that is too strong. I view compatibility with NumPy a beneficial feature, but must be weighed against other impacts of changes. In this instance, I am against changing the default of skipna to agree with NumPy, at least just for the sake of compatibility.

@torext
Copy link
Contributor

torext commented Jan 22, 2024

NB: I would argue that changing the default setting of the parameter should be considered a breaking change in the API. I presume that a lot of scripts already assume that skipna=True is indeed the default behavior now.

I very much agree with this. Pandas is already a sort of convenience wrapper around NumPy, adding additional abstractions like coordinate names etc. Having NaNs ignored when computing standard deviation by default seems desireable in most cases, and even NumPy has a separate function indeed for handling NaNs in this case (np.nanstd). IMO it's safe to just assume that Pandas implements by default the more convenient of two options.

Please do not yet again change default behaviour just because someone was bored on a Sunday afternoon and had an epiphany; people write massive libraries based on Pandas and its default behaviours are often taken into account for the sake of code-readability and -brevity. skipna=False is documented and anyone who wants that level of strictness can easily enable it already that way.

I do agree that the documentation giving instructions for replicating NumPy behaviour are incomplete; that's IMO all that should be fixed here.

@jbrockmendel
Copy link
Member

+1 for adding the argument to groupby.std for consistency.

@rhshadrach
Copy link
Member

I've updated #15675 for tracking adding skipna; reworking this as purely a documentation issue.

@rhshadrach rhshadrach added Docs and removed Bug Needs Discussion Requires discussion from core team before further action labels Jan 26, 2024
@rhshadrach rhshadrach changed the title BUG: inconsistent skipna arguments for std method of DataFrame/pd.Series/numpy DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values Jan 26, 2024
@JoostvanPinxten
Copy link
Author

Sounds good to me. I hope to have some time tonight, but I am also OK if someone else picks this up.

@Ahmedniz1
Copy link

Can I get to know where to highlight the difference exactly in the documentation.
I'd love to take this up.

@JoostvanPinxten
Copy link
Author

I only know the resulting documentation, not where the sources are. There are quite a few places where e.g. std is used. A few samples:

GroupBy std
DataFrameGroupBy std
SeriesGroupBy std

It should contain something similar to:
DataFrame std

Regarding the skipna argument: "Exclude NA/null values. If an entire row/column is NA, the result will be NA."

Regarding the note about ddof and consistency with numpy, see an example here:
image

But there are other operations where skipna is also relevant and perhaps non-default in numpy. I do not know how to best approach this in the code-base though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

6 participants