-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPS: Test NEP 50 #55739
DEPS: Test NEP 50 #55739
Conversation
@seberg could use your insight to with this behavior In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: np._set_promotion_state("weak")
In [11]: lib.maybe_convert_objects(np.array([np.int64(0)], dtype=object))
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Cell In[11], line 1
----> 1 lib.maybe_convert_objects(np.array([np.int64(0)], dtype=object))
File ~/pandas/_libs/lib.pyx:2601, in pandas._libs.lib.maybe_convert_objects()
2599
2600 if ((seen.uint_ and seen.sint_) or
-> 2601 val > oUINT64_MAX or val < oINT64_MIN):
2602 seen.object_ = True
2603 break
OverflowError: Python int too large to convert to C long IIRC Lines 10 to 14 in 984d755
being cast to Numpy ints. Lines 143 to 145 in 984d755
I am not exactly sure, but is |
@mroeschke the problem is promotion, normally promotion will convert the python int to the existing version in binary ops. That said, the NumPy dev branch has additional logic in place to allow comparisons to always succeed. |
OK great! This is definitely more convenient behavior |
Also, if we have a dependency (pytables) that uses non-NEP 50 compatible behavior and we make pandas NEP 50 compatible, would you recommend using |
OK, I went and dug into a couple tests and it looks like we might be failing some tests because of the new rules such as uint64 * int64 = float64. I guess we could fix it by trying to cast to the Series/Index dtype inside arithmetic operations. |
Just to note, that rule is not new (everyone hates it). What is new, though is that if you have have uint64 or int64 be a 0-D object, it was previously cast. Which is what you still get, but only if it is a Python integer. |
That makes sense, the failing test I was looking at was doing something like
This is kind of a funky case. Normally, I think it might be OK to just follow the NEP 50 changes (and change the tests), since I think users will also expect Index/Series arithmetic to follow numpy semantics. As Sebastian mentions above, you already get float64 if you multiply by something like ``np.array([3], dtype="int64") The only issue here, though is that the Arrow backed Indexes will still do the old behavior of trying to safe cast, so we'd be diverging in terms of behavior there, and I'd really like to keep everything consistent. Here's a little demonstration of the differences
@pandas-dev/pandas-core thoughts? |
There are a lot of options here all with different merits and I'm not sure there will be a clear solution. Are those pyarrow results documented behavior? Even the uint64 * int64 resulting in int64 is surprising to me. I would have expected to follow C type promotion rules and end with a uint64 |
OK, added more tests. Looks like the one case (that I've caught), where my hack doesn't work is 0, I special cased that out separately. |
with pytest.raises(ValueError, match=msg): | ||
if np_version_gt2: | ||
msg = "The elements provided in the data cannot all be casted to the dtype" | ||
err = OverflowError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be catching and re-raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worth it. I like the new error from numpy better anyways.
Do we need to preserve backwards compatibility here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with this - I am not sure I see the value in re-raising as a ValueError when the OverflowError is more descriptive of what went wrong to begin with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed OverflowError seems More Correct; my main opinion is that we should be consistent (no idea what the status quo is consistency-wise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's a good point. I wouldn't mind generally not replacing OverflowError with ValueError; they are both part of the stdlib but the former is more descriptive.
I am unaware of the history behind why that pattern might have been implemented
names=["CL0", "CL1"], | ||
), | ||
"left", | ||
"multiindex_1", | ||
), | ||
( | ||
MultiIndex.from_tuples(list(zip(range(4), np.mod(range(4), 2)))), | ||
MultiIndex.from_arrays([np.arange(4), np.mod(range(4), 2)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to pass dtype to arange otherwise this will be different on 32bit builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it wont make a difference in this case
pandas/core/dtypes/cast.py
Outdated
# Special case 0, our trick will not work for it since | ||
# np.min_scalar_type(-0) will still give something unsigned | ||
right = left_dtype | ||
elif right > 0 and not np.issubdtype(left_dtype, np.unsignedinteger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be simpler if you just did:
elif right > 0 and right <= 2 ** (8 * left_dtype.itemsize - 1) - 1:
# use signed dtype
would cut down on some of the branching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, do you know a way to programatically convert from the unsigned dtype to the signed one?
I think that was why I went with my hack in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could create the new type by doing something like f"i{left_dtype.itemsize}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Updated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - nice work!
Will put this in for the RC. Happy to address review comments in a follow up. Thanks for the reviews everyone! |
@@ -92,7 +92,7 @@ jobs: | |||
- name: "Numpy Dev" | |||
env_file: actions-311-numpydev.yaml | |||
pattern: "not slow and not network and not single_cpu" | |||
test_args: "-W error::DeprecationWarning -W error::FutureWarning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to still check for a DeprecationWarning here
@@ -30,9 +30,9 @@ authors = [ | |||
license = {file = 'LICENSE'} | |||
requires-python = '>=3.9' | |||
dependencies = [ | |||
"numpy>=1.22.4,<2; python_version<'3.11'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also unpin numpy in our ci/deps/
files?
right = left_dtype | ||
elif ( | ||
not np.issubdtype(left_dtype, np.unsignedinteger) | ||
and 0 < right <= 2 ** (8 * right_dtype.itemsize - 1) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 ** (8 * right_dtype.itemsize - 1) - 1
can be replaced with np.iinfo(right_dtype).max
I think
followup PR created @ #56682 |
* DEPS: Test NEP 50 * Use Python floats in test_maybe_promote_float_with_float * Refactor test_to_html_multiindex to allow tests to collect * Supress deprecationwarning for now * Use old invocation * Use Python ints in _range.py functions * Address test_constructor * Fix test_constructor_coercion_signed_to_unsigned * Fix test_constructor_coercion_signed_to_unsigned * Cast numpy scalars as python scalars before arith ops * add xfail reason to TestCoercionFloat32 * only set promotion state for numpy > 2.0 * order was backwards * Version promotion state call * fix timedelta tests * go for green * fix non npdev too? * fixes * adjust xfail condition * go for green * add tests * add negative numbers test * updates * fix accidental changes * more * simplify * linter --------- Co-authored-by: Thomas Li <[email protected]>
"numpy>=1.23.2,<2; python_version=='3.11'", | ||
"numpy>=1.26.0,<2; python_version>='3.12'", | ||
"numpy>=1.22.4; python_version<'3.11'", | ||
"numpy>=1.23.2; python_version=='3.11'", | ||
"numpy>=1.26.0; python_version>='3.12'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR removed the numpy<2 pin again, which is good for the development branch (and nightly wheels), but shouldn't we still keep the pin for the release branch?
Unless we are planning to build the pandas 2.2.0 wheels with nightly numpy (which isn't the case right now), the upcoming pandas 2.2.0 release is still not ABI compatible with numpy 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right by my understanding of packaging around the NumPy 2 release. Should keep the pin on until wheels can be built against NumPy 2.0.0rc1. Then can build against NumPy>=2.0.0rc1 and can have runtime dependency of NumPy 1.2x+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly off topic but do we know what prevents us from using build isolation with meson? That would seemingly make this issue easier to manage as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR removed the numpy<2 pin again, which is good for the development branch (and nightly wheels), but shouldn't we still keep the pin for the release branch?
Unless we are planning to build the pandas 2.2.0 wheels with nightly numpy (which isn't the case right now), the upcoming pandas 2.2.0 release is still not ABI compatible with numpy 2.0
Yeah, I just misjudged the timing for the numpy 2.0 release.
(Originally I think their tracker issue said that the RC would happen in January, which we could then build against to have compat with both numpy 1/2).
I'll put a PR up to put back the pin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly off topic but do we know what prevents us from using build isolation with meson? That would seemingly make this issue easier to manage as well
Sorry, not quite following here.
IIUC, build isolation means building the package in a separate environment from the one where it's installed in.
We currently just disable it for editable builds (otherwise the auto-rebuilding wouldn't work), and also on regular CI (since there's no point in installing the build tools again when they're already installed in the conda environment).
I'm pretty sure cibuildwheel might use it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there's no point in installing the build tools again when they're already installed in the conda environment)
The benefit depends on if you consider the conda environment to be a build environment or a runtime environment. The advantage of an isolated build environment is you can build your project against any particular version like oldest-supported-numpy
for binary compatability but choose a different version for the runtime.
I haven't closely tracked the particular case of NumPy 2.0 but it sounds like that could be useful based off of @bashtage comment #55739 (comment)
closes #56563