BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

CompRhys · 2023-06-29T21:57:35Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
np.mean(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
np.sum(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
np.mean(((np.nan, np.nan), (np.nan, np.nan)), axis=1)


output:
```python
>>> np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
0    0.0
1    0.0
dtype: float64
>>> np.mean(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
0   NaN
1   NaN
dtype: float64
>>> np.sum(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
array([nan, nan])
>>> np.mean(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
array([nan, nan])



### Issue Description

Surfaced a long running bug due to pandas overriding numpy but returning different results.

### Expected Behavior

Pandas should not use magic to override numpy operations where it doesn't give the same answer by default.

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.5.0
Version          : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.3
numpy            : 1.23.5
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : 7.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.1
html5lib         : None
pymysql          : 1.0.2
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.7.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           : None
fastparquet      : 2022.12.0
fsspec           : 2023.5.0
gcsfs            : 2023.5.0
matplotlib       : 3.6.2
numba            : 0.56.4
numexpr          : None
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.8.1
snappy           : None
sqlalchemy       : 1.4.44
tables           : None
tabulate         : 0.9.0
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : 2023.3

</details>

The text was updated successfully, but these errors were encountered:

CompRhys · 2023-06-30T02:03:33Z

Potential fix would be to allow us to set either skipna or min_count globally from https://pandas.pydata.org/docs/user_guide/options.html#

attack68 · 2023-06-30T08:35:22Z

This isnt the same operation, so claiming it should give the same answer by default is debatable.
I would argue that:
np.sum(ndarray) and np.sum(pandas.DataFrame) are entirely able to give different answers dependent upon the data structure's own definition and operations on missing data.

Whilst your single example is localised in this case there may be other downstream reasons why your solution, which might be suitable here, is unsuitable in other cases, with what might appear to be as obvious cases local to them.

Of course the explicit solution is to do:

>>> np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).to_numpy(), axis=1)
array([nan, nan])

I would use this becuase it avoids the issue and the code is more statically typed so is more likely to be robust.

CompRhys · 2023-06-30T13:34:00Z

I think this is a very poor API decision then given that it is not possible to access the kwargs of the pandas method here due and it is not made clear to the user anywhere in the pandas documentation that this will happen.

attack68 · 2023-06-30T14:53:32Z

numpy.sum is a numpy method. It doesnt accept pandas kwargs. There is a pandas.DataFrame.sum method which accepts kwargs and behaves differently, or you can use numpy.sum(DataFrame.to_numpy()).

May I ask what you you have against using the alternative provided?

CompRhys · 2023-06-30T15:08:58Z

I already applied the fix you suggested before this issue once we understood what had happened it's easy to cast to array and then reconstruct the dataframe after. The issue is about ensuring that others don't see this non intuitive interaction. The code is not using pd.DataFrame().sum() it is explicitly using the np.sum() function with expectation that it will apply the np.sum() function. If pandas wants to apply magic such that numpy functions acting on pandas objects return pandas objects great, but if that magic changes the behaviour and the kwargs to control it are impossible to access then and this isn't documented anywhere in an obvious manner then this is obviously a bug in the API.

attack68 · 2023-06-30T15:50:33Z

Suppose your suggested change was implemented. Next week an issue will be raised that reads:

Hi, when I do this, I get the following:

>>> pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum()
0    0.0
1    0.0
dtype: float64

But when I do this, I get something else:

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))))
array([nan, nan])

SInce the pandas docs identify the skipna=True default and they specifically state that this is the same as numpy.sum
Isnt this obviously a bug that it is not the same, or is the documentation is not correct?

I'm happy to reserve any judgement, I just disagree with you about the "obvious-ness" of this being a bug. I think np.sum should adopt the defaults of the pd.DataFrame.sum method, as currently in force.

Documentation can always be better. Pandas is volunteer and free software, and if you feel it can be improved for the benefit of others please do submit a PR.

CompRhys · 2023-06-30T16:39:47Z

pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum(axis=1)
np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)

for the first line I would look at these docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html, for the second line I would look at these docs https://numpy.org/doc/stable/reference/generated/numpy.sum.html. You're telling me that 2 is an alias for 1 but where is it documented as such? I would think it's much more common for someone using numpy functions to see that the DataFrame is an array_like in the sense of numpy docs and so it is a valid input to the numpy function therefore I should expect the numpy function to apply. If the function is "equivalent" it doesn't make sense that the following should be invalid given the numpy docs?

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, dtype=float)

if you expect that to error then should this also error?

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, min_count=1)

My suggested change was to allow us a global option to adjust the default behaviour causing these supposedly "equivalent" methods to differ. The problem is that the snippet directly above won't take kwargs to either pandas or numpy ops. I fail to see how that would cause any follow-up issues? It would simply be included in the docs that you could set skipna or min_count globally in order to ensure that when pandas grabs the hooks on these numpy function you get the expected result.

CompRhys · 2023-10-27T00:22:30Z

BUMP

CompRhys · 2024-04-05T17:07:24Z

@rhshadrach Could you take a look at this report also, the fact that the pandas hooks into the numpy function and then doesn't replicate default behaviour has just caused a bunch of issues for someone on our team. There is no way I can find to get pandas to not skipna if using the numpy command on the df. I believe this matches the non-intuitive behaviour that you also agreed with in #58015

rhshadrach · 2024-04-06T14:36:12Z

Pandas should not use magic to override numpy operations where it doesn't give the same answer by default.

the fact that the pandas hooks into the numpy function

I do not understand these. You are calling a NumPy function, pandas has no control over its behavior. NumPy looks for a sum attribute on the object you pass and calls it.

https://github.com/numpy/numpy/blob/e59c074842e3f73483afa5ddef031e856b9fd313/numpy/_core/fromnumeric.py#L75

Unless your contention is that pandas should not have a DataFrame.sum method, it seems to me this is not in pandas control.

If I'm understanding the conversation right, in #53939 (comment) you state that you can use DataFrame.sum instead of np.sum. This seems like the correct solution to me. I am against the complexity and surprising behavior of having a global option to underride default arguments of pandas methods. Can you detail why you cannot use DataFrame.sum?

rhshadrach · 2024-04-06T14:37:20Z

I believe this matches the non-intuitive behaviour that you also agreed with in #58015

I don't see how this is at all related. Can you explain how this matches?

CompRhys · 2024-04-06T15:59:04Z

I think my issue is misplaced, the issue appears to be that the sum_dispatcher doesn't allow passthrough kwargs in numpy and the root issue I have is that I find the skipna=True default surprising given that it leads to silent failure.

This relates to the other issue in so far as the numerical output using pandas is different than using pure numpy for have is ostensibly the same code. This is problematic in our codebase as we have functions that operate on both numpy arrays and pandas dataframes. I guess it just means that's an anti-pattern that we need to eliminate.

CompRhys added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2023

CompRhys closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

CompRhys commented Jun 29, 2023

CompRhys commented Jun 30, 2023

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023 •

edited

Loading

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023 •

edited

Loading

CompRhys commented Oct 27, 2023

CompRhys commented Apr 5, 2024

rhshadrach commented Apr 6, 2024 •

edited

Loading

rhshadrach commented Apr 6, 2024

CompRhys commented Apr 6, 2024

BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

Comments

CompRhys commented Jun 29, 2023

Pandas version checks

Reproducible Example

CompRhys commented Jun 30, 2023

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023 • edited Loading

attack68 commented Jun 30, 2023

CompRhys commented Jun 30, 2023 • edited Loading

CompRhys commented Oct 27, 2023

CompRhys commented Apr 5, 2024

rhshadrach commented Apr 6, 2024 • edited Loading

rhshadrach commented Apr 6, 2024

CompRhys commented Apr 6, 2024

CompRhys commented Jun 30, 2023 •

edited

Loading

CompRhys commented Jun 30, 2023 •

edited

Loading

rhshadrach commented Apr 6, 2024 •

edited

Loading