-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Numpy magic leading to inconsistent behaviour due to pandas default to skipna in sum. #53939
Comments
Potential fix would be to allow us to set either |
This isnt the same operation, so claiming it should give the same answer by default is debatable. Whilst your single example is localised in this case there may be other downstream reasons why your solution, which might be suitable here, is unsuitable in other cases, with what might appear to be as obvious cases local to them. Of course the explicit solution is to do: >>> np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).to_numpy(), axis=1)
array([nan, nan]) I would use this becuase it avoids the issue and the code is more statically typed so is more likely to be robust. |
I think this is a very poor API decision then given that it is not possible to access the kwargs of the pandas method here due and it is not made clear to the user anywhere in the pandas documentation that this will happen. |
May I ask what you you have against using the alternative provided? |
I already applied the fix you suggested before this issue once we understood what had happened it's easy to cast to array and then reconstruct the dataframe after. The issue is about ensuring that others don't see this non intuitive interaction. The code is not using |
Suppose your suggested change was implemented. Next week an issue will be raised that reads: Hi, when I do this, I get the following: >>> pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum()
0 0.0
1 0.0
dtype: float64 But when I do this, I get something else: np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))))
array([nan, nan]) SInce the pandas docs identify the I'm happy to reserve any judgement, I just disagree with you about the "obvious-ness" of this being a bug. I think np.sum should adopt the defaults of the Documentation can always be better. Pandas is volunteer and free software, and if you feel it can be improved for the benefit of others please do submit a PR. |
pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))).sum(axis=1)
np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1) for the first line I would look at these docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html, for the second line I would look at these docs https://numpy.org/doc/stable/reference/generated/numpy.sum.html. You're telling me that 2 is an alias for 1 but where is it documented as such? I would think it's much more common for someone using numpy functions to see that the DataFrame is an array_like in the sense of numpy docs and so it is a valid input to the numpy function therefore I should expect the numpy function to apply. If the function is "equivalent" it doesn't make sense that the following should be invalid given the numpy docs? np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, dtype=float) if you expect that to error then should this also error? np.sum(pd.DataFrame(((np.nan, np.nan), (np.nan, np.nan))), axis=1, min_count=1) My suggested change was to allow us a global option to adjust the default behaviour causing these supposedly "equivalent" methods to differ. The problem is that the snippet directly above won't take kwargs to either pandas or numpy ops. I fail to see how that would cause any follow-up issues? It would simply be included in the docs that you could set |
BUMP |
@rhshadrach Could you take a look at this report also, the fact that the pandas hooks into the numpy function and then doesn't replicate default behaviour has just caused a bunch of issues for someone on our team. There is no way I can find to get pandas to not skipna if using the numpy command on the df. I believe this matches the non-intuitive behaviour that you also agreed with in #58015 |
I do not understand these. You are calling a NumPy function, pandas has no control over its behavior. NumPy looks for a Unless your contention is that pandas should not have a If I'm understanding the conversation right, in #53939 (comment) you state that you can use |
I don't see how this is at all related. Can you explain how this matches? |
I think my issue is misplaced, the issue appears to be that the sum_dispatcher doesn't allow passthrough kwargs in numpy and the root issue I have is that I find the This relates to the other issue in so far as the numerical output using pandas is different than using pure numpy for have is ostensibly the same code. This is problematic in our codebase as we have functions that operate on both numpy arrays and pandas dataframes. I guess it just means that's an anti-pattern that we need to eliminate. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
The text was updated successfully, but these errors were encountered: