-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Highlight the difference between DataFrame/pd.Series/numpy ops when there are NA values #56939
Comments
Thanks for the report. Agreed on adding the |
Related: #15675 |
The current docstring has a note:
That could indeed definitely be updated to mention this is only the case if there are no missing values (or that you have to specify |
Is this task available, or is it in discussion as of now? |
display(np.mean([np.nan,2,3,4,5])) # --> nan
display(pd.Series([np.nan,2,3,4,5]).mean()) # --> 3.5 The mean function has the same And the I think you may want to add these in a consistent manner over all |
What is the stance on being compatible to NumPy? Should Pandas follow that where possible? NB: I would argue that changing the default setting of the parameter should be considered a breaking change in the API. I presume that a lot of scripts already assume that skipna=True is indeed the default behavior now. |
I think adding
I think that is too strong. I view compatibility with NumPy a beneficial feature, but must be weighed against other impacts of changes. In this instance, I am against changing the default of |
I very much agree with this. Pandas is already a sort of convenience wrapper around NumPy, adding additional abstractions like coordinate names etc. Having NaNs ignored when computing standard deviation by default seems desireable in most cases, and even NumPy has a separate function indeed for handling NaNs in this case ( Please do not yet again change default behaviour just because someone was bored on a Sunday afternoon and had an epiphany; people write massive libraries based on Pandas and its default behaviours are often taken into account for the sake of code-readability and -brevity. I do agree that the documentation giving instructions for replicating NumPy behaviour are incomplete; that's IMO all that should be fixed here. |
+1 for adding the argument to groupby.std for consistency. |
I've updated #15675 for tracking adding skipna; reworking this as purely a documentation issue. |
Sounds good to me. I hope to have some time tonight, but I am also OK if someone else picks this up. |
Can I get to know where to highlight the difference exactly in the documentation. |
I only know the resulting documentation, not where the sources are. There are quite a few places where e.g. std is used. A few samples: GroupBy std It should contain something similar to: Regarding the skipna argument: "Exclude NA/null values. If an entire row/column is NA, the result will be NA." Regarding the note about ddof and consistency with numpy, see an example here: But there are other operations where skipna is also relevant and perhaps non-default in numpy. I do not know how to best approach this in the code-base though. |
Edit[rhshadrach]: The original report about inconsistent skipna arguments in groupby is captured in #15675. For this issue, it suffices to add notes to the documentation about the default incompatibility between pandas and NumPy as described below.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I've wasted quite some time trying to find out why in some edge cases (on a large dataset) a different value was reported by DataFrame.std vs numpy.std. The posts and documentation I've found focus on the DDOF parameter only. It turned out to be related to fewer NaN values being reported.
Expected Behavior
ChatGPT hinted to me that the Pandas Series has a skipna=True argument. This is different from the numpy.std, which does NOT skip na by default. And even has no such optional argument.
I would suggest to:
I can contribute this patch if this is indeed the right way to go.
Installed Versions
INSTALLED VERSIONS
commit : 0f43794
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-372.19.1.el8_6.x86_64
Version : #1 SMP Tue Aug 2 16:19:42 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.3
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : 1.1.0
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
snappy : None
sqlalchemy : 1.4.49
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: