Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939

Closed
3 tasks done
CompRhys opened this issue Jun 29, 2023 · 12 comments
Closed
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@CompRhys
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
np.mean(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
np.sum(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
np.mean(((np.nan, np.nan), (np.nan, np.nan)), axis=1)


output:
```python
>>> np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
0    0.0
1    0.0
dtype: float64
>>> np.mean(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)
0   NaN
1   NaN
dtype: float64
>>> np.sum(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
array([nan, nan])
>>> np.mean(((np.nan, np.nan), (np.nan, np.nan)), axis=1)
array([nan, nan])


### Issue Description

Surfaced a long running bug due to pandas overriding numpy but returning different results.

### Expected Behavior

Pandas should not use magic to override numpy operations where it doesn't give the same answer by default.

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.5.0
Version          : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.3
numpy            : 1.23.5
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : 7.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.1
html5lib         : None
pymysql          : 1.0.2
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.7.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           : None
fastparquet      : 2022.12.0
fsspec           : 2023.5.0
gcsfs            : 2023.5.0
matplotlib       : 3.6.2
numba            : 0.56.4
numexpr          : None
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.8.1
snappy           : None
sqlalchemy       : 1.4.44
tables           : None
tabulate         : 0.9.0
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : 2023.3

</details>
@CompRhys CompRhys added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2023
@CompRhys
Copy link
Author

Potential fix would be to allow us to set either skipna or min_count globally from https://pandas.pydata.org/docs/user_guide/options.html#

@attack68
Copy link
Contributor

This isnt the same operation, so claiming it should give the same answer by default is debatable.
I would argue that:
np.sum(ndarray) and np.sum(pandas.DataFrame) are entirely able to give different answers dependent upon the data structure's own definition and operations on missing data.

Whilst your single example is localised in this case there may be other downstream reasons why your solution, which might be suitable here, is unsuitable in other cases, with what might appear to be as obvious cases local to them.

Of course the explicit solution is to do:

>>> np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).to_numpy(), axis=1)
array([nan, nan])

I would use this becuase it avoids the issue and the code is more statically typed so is more likely to be robust.

@CompRhys
Copy link
Author

I think this is a very poor API decision then given that it is not possible to access the kwargs of the pandas method here due and it is not made clear to the user anywhere in the pandas documentation that this will happen.

@attack68
Copy link
Contributor

numpy.sum is a numpy method. It doesnt accept pandas kwargs. There is a pandas.DataFrame.sum method which accepts kwargs and behaves differently, or you can use numpy.sum(DataFrame.to_numpy()).

May I ask what you you have against using the alternative provided?

@CompRhys
Copy link
Author

CompRhys commented Jun 30, 2023

I already applied the fix you suggested before this issue once we understood what had happened it's easy to cast to array and then reconstruct the dataframe after. The issue is about ensuring that others don't see this non intuitive interaction. The code is not using pd.DataFrame().sum() it is explicitly using the np.sum() function with expectation that it will apply the np.sum() function. If pandas wants to apply magic such that numpy functions acting on pandas objects return pandas objects great, but if that magic changes the behaviour and the kwargs to control it are impossible to access then and this isn't documented anywhere in an obvious manner then this is obviously a bug in the API.

@attack68
Copy link
Contributor

Suppose your suggested change was implemented. Next week an issue will be raised that reads:

Hi, when I do this, I get the following:

>>> pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum()
0    0.0
1    0.0
dtype: float64

But when I do this, I get something else:

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))))
array([nan, nan])

SInce the pandas docs identify the skipna=True default and they specifically state that this is the same as numpy.sum
Isnt this obviously a bug that it is not the same, or is the documentation is not correct?

Screenshot 2023-06-30 at 17 45 49

I'm happy to reserve any judgement, I just disagree with you about the "obvious-ness" of this being a bug. I think np.sum should adopt the defaults of the pd.DataFrame.sum method, as currently in force.

Documentation can always be better. Pandas is volunteer and free software, and if you feel it can be improved for the benefit of others please do submit a PR.

@CompRhys
Copy link
Author

CompRhys commented Jun 30, 2023

pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum(axis=1)
np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1)

for the first line I would look at these docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html, for the second line I would look at these docs https://numpy.org/doc/stable/reference/generated/numpy.sum.html. You're telling me that 2 is an alias for 1 but where is it documented as such? I would think it's much more common for someone using numpy functions to see that the DataFrame is an array_like in the sense of numpy docs and so it is a valid input to the numpy function therefore I should expect the numpy function to apply. If the function is "equivalent" it doesn't make sense that the following should be invalid given the numpy docs?

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, dtype=float)

if you expect that to error then should this also error?

np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, min_count=1)

My suggested change was to allow us a global option to adjust the default behaviour causing these supposedly "equivalent" methods to differ. The problem is that the snippet directly above won't take kwargs to either pandas or numpy ops. I fail to see how that would cause any follow-up issues? It would simply be included in the docs that you could set skipna or min_count globally in order to ensure that when pandas grabs the hooks on these numpy function you get the expected result.

@CompRhys
Copy link
Author

BUMP

@CompRhys
Copy link
Author

CompRhys commented Apr 5, 2024

@rhshadrach Could you take a look at this report also, the fact that the pandas hooks into the numpy function and then doesn't replicate default behaviour has just caused a bunch of issues for someone on our team. There is no way I can find to get pandas to not skipna if using the numpy command on the df. I believe this matches the non-intuitive behaviour that you also agreed with in #58015

@rhshadrach
Copy link
Member

rhshadrach commented Apr 6, 2024

Pandas should not use magic to override numpy operations where it doesn't give the same answer by default.

the fact that the pandas hooks into the numpy function

I do not understand these. You are calling a NumPy function, pandas has no control over its behavior. NumPy looks for a sum attribute on the object you pass and calls it.

https://github.com/numpy/numpy/blob/e59c074842e3f73483afa5ddef031e856b9fd313/numpy/_core/fromnumeric.py#L75

Unless your contention is that pandas should not have a DataFrame.sum method, it seems to me this is not in pandas control.

If I'm understanding the conversation right, in #53939 (comment) you state that you can use DataFrame.sum instead of np.sum. This seems like the correct solution to me. I am against the complexity and surprising behavior of having a global option to underride default arguments of pandas methods. Can you detail why you cannot use DataFrame.sum?

@rhshadrach
Copy link
Member

I believe this matches the non-intuitive behaviour that you also agreed with in #58015

I don't see how this is at all related. Can you explain how this matches?

@CompRhys
Copy link
Author

CompRhys commented Apr 6, 2024

I think my issue is misplaced, the issue appears to be that the sum_dispatcher doesn't allow passthrough kwargs in numpy and the root issue I have is that I find the skipna=True default surprising given that it leads to silent failure.

This relates to the other issue in so far as the numerical output using pandas is different than using pure numpy for have is ostensibly the same code. This is problematic in our codebase as we have functions that operate on both numpy arrays and pandas dataframes. I guess it just means that's an anti-pattern that we need to eliminate.

@CompRhys CompRhys closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants