Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPS: Test NEP 50 #55739

Merged
merged 36 commits into from
Dec 22, 2023
Merged

DEPS: Test NEP 50 #55739

merged 36 commits into from
Dec 22, 2023

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Oct 27, 2023

closes #56563

@mroeschke mroeschke added the Compat pandas objects compatability with Numpy or Python functions label Oct 27, 2023
@mroeschke
Copy link
Member Author

@seberg could use your insight to with this behavior

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: np._set_promotion_state("weak")

In [11]: lib.maybe_convert_objects(np.array([np.int64(0)], dtype=object))
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[11], line 1
----> 1 lib.maybe_convert_objects(np.array([np.int64(0)], dtype=object))

File ~/pandas/_libs/lib.pyx:2601, in pandas._libs.lib.maybe_convert_objects()
   2599 
   2600                 if ((seen.uint_ and seen.sint_) or
-> 2601                         val > oUINT64_MAX or val < oINT64_MIN):
   2602                     seen.object_ = True
   2603                     break

OverflowError: Python int too large to convert to C long

IIRC oUINT64_MAX and oINT64_MIN are originally C Ints

INT64_MIN,
UINT8_MAX,
UINT16_MAX,
UINT32_MAX,
UINT64_MAX,

being cast to Numpy ints.

pandas/pandas/_libs/lib.pyx

Lines 143 to 145 in 984d755

object oINT64_MAX = <int64_t>INT64_MAX
object oINT64_MIN = <int64_t>INT64_MIN
object oUINT64_MAX = <uint64_t>UINT64_MAX

I am not exactly sure, but is val a numpy int64 being cast here?

@seberg
Copy link
Contributor

seberg commented Oct 28, 2023

@mroeschke the problem is promotion, normally promotion will convert the python int to the existing version in binary ops. That said, the NumPy dev branch has additional logic in place to allow comparisons to always succeed.

@mroeschke
Copy link
Member Author

That said, the NumPy dev branch has additional logic in place to allow comparisons to always succeed.

OK great! This is definitely more convenient behavior

@mroeschke
Copy link
Member Author

Also, if we have a dependency (pytables) that uses non-NEP 50 compatible behavior and we make pandas NEP 50 compatible, would you recommend using _set/get_promotion_state around the calls that rely on old promotion rules?

@lithomas1
Copy link
Member

OK, I went and dug into a couple tests and it looks like we might be failing some tests because of the new rules such as uint64 * int64 = float64.

I guess we could fix it by trying to cast to the Series/Index dtype inside arithmetic operations.

@seberg
Copy link
Contributor

seberg commented Nov 6, 2023

the new rules such as uint64 * int64 = float64

Just to note, that rule is not new (everyone hates it). What is new, though is that if you have have uint64 or int64 be a 0-D object, it was previously cast. Which is what you still get, but only if it is a Python integer.

@lithomas1
Copy link
Member

lithomas1 commented Nov 8, 2023

the new rules such as uint64 * int64 = float64

Just to note, that rule is not new (everyone hates it). What is new, though is that if you have have uint64 or int64 be a 0-D object, it was previously cast. Which is what you still get, but only if it is a Python integer.

That makes sense, the failing test I was looking at was doing something like

pd.Index([1,2,3], dtype="uint64") * np.int64(3)

This is kind of a funky case. Normally, I think it might be OK to just follow the NEP 50 changes (and change the tests), since I think users will also expect Index/Series arithmetic to follow numpy semantics. As Sebastian mentions above, you already get float64 if you multiply by something like ``np.array([3], dtype="int64")

The only issue here, though is that the Arrow backed Indexes will still do the old behavior of trying to safe cast, so we'd be diverging in terms of behavior there, and I'd really like to keep everything consistent.

Here's a little demonstration of the differences

>>> import numpy as np
>>> import pandas as pd
>>> pd.Index([1,2,3], dtype="uint64") * np.int64(3)
Index([3.0, 6.0, 9.0], dtype='float64') # This makes the tests fail
>>> pd.Index([1,2,3], dtype="uint64").astype("uint64[pyarrow]") * np.int64(3)
Index([3, 6, 9], dtype='int64[pyarrow]') # Note how pyarrow preserves the int dtype
>>> pd.Index([1,2,3], dtype="uint64") * np.array([3], dtype="int64")
Index([3.0, 6.0, 9.0], dtype='float64') # This happens even in older numpys
>>> pd.Index([1,2,3], dtype="uint64").astype("uint64[pyarrow]") * np.array([3], dtype="int64")
# This doesn't work in pyarrow

@pandas-dev/pandas-core thoughts?

@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2023

There are a lot of options here all with different merits and I'm not sure there will be a clear solution.

Are those pyarrow results documented behavior? Even the uint64 * int64 resulting in int64 is surprising to me. I would have expected to follow C type promotion rules and end with a uint64

@lithomas1 lithomas1 mentioned this pull request Dec 18, 2023
@lithomas1
Copy link
Member

OK, added more tests. Looks like the one case (that I've caught), where my hack doesn't work is 0, I special cased that out separately.

@lithomas1 lithomas1 requested a review from WillAyd December 20, 2023 04:22
with pytest.raises(ValueError, match=msg):
if np_version_gt2:
msg = "The elements provided in the data cannot all be casted to the dtype"
err = OverflowError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be catching and re-raising?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth it. I like the new error from numpy better anyways.

Do we need to preserve backwards compatibility here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this - I am not sure I see the value in re-raising as a ValueError when the OverflowError is more descriptive of what went wrong to begin with

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed OverflowError seems More Correct; my main opinion is that we should be consistent (no idea what the status quo is consistency-wise)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's a good point. I wouldn't mind generally not replacing OverflowError with ValueError; they are both part of the stdlib but the former is more descriptive.

I am unaware of the history behind why that pattern might have been implemented

names=["CL0", "CL1"],
),
"left",
"multiindex_1",
),
(
MultiIndex.from_tuples(list(zip(range(4), np.mod(range(4), 2)))),
MultiIndex.from_arrays([np.arange(4), np.mod(range(4), 2)]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to pass dtype to arange otherwise this will be different on 32bit builds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it wont make a difference in this case

# Special case 0, our trick will not work for it since
# np.min_scalar_type(-0) will still give something unsigned
right = left_dtype
elif right > 0 and not np.issubdtype(left_dtype, np.unsignedinteger):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be simpler if you just did:

elif right > 0 and right <= 2 ** (8 * left_dtype.itemsize - 1) - 1:
    # use signed dtype

would cut down on some of the branching

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do you know a way to programatically convert from the unsigned dtype to the signed one?

I think that was why I went with my hack in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could create the new type by doing something like f"i{left_dtype.itemsize}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Updated the code.

@lithomas1 lithomas1 requested a review from phofl December 21, 2023 03:26
@lithomas1 lithomas1 linked an issue Dec 21, 2023 that may be closed by this pull request
3 tasks
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice work!

@lithomas1
Copy link
Member

Will put this in for the RC. Happy to address review comments in a follow up.

Thanks for the reviews everyone!

@lithomas1 lithomas1 merged commit e0e47e8 into pandas-dev:main Dec 22, 2023
45 checks passed
@@ -92,7 +92,7 @@ jobs:
- name: "Numpy Dev"
env_file: actions-311-numpydev.yaml
pattern: "not slow and not network and not single_cpu"
test_args: "-W error::DeprecationWarning -W error::FutureWarning"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to still check for a DeprecationWarning here

@@ -30,9 +30,9 @@ authors = [
license = {file = 'LICENSE'}
requires-python = '>=3.9'
dependencies = [
"numpy>=1.22.4,<2; python_version<'3.11'",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also unpin numpy in our ci/deps/ files?

right = left_dtype
elif (
not np.issubdtype(left_dtype, np.unsignedinteger)
and 0 < right <= 2 ** (8 * right_dtype.itemsize - 1) - 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 ** (8 * right_dtype.itemsize - 1) - 1 can be replaced with np.iinfo(right_dtype).max I think

@lithomas1
Copy link
Member

followup PR created @ #56682

@mroeschke mroeschke deleted the compat/nep50 branch December 29, 2023 19:50
cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024
* DEPS: Test NEP 50

* Use Python floats in test_maybe_promote_float_with_float

* Refactor test_to_html_multiindex to allow tests to collect

* Supress deprecationwarning for now

* Use old invocation

* Use Python ints in _range.py functions

* Address test_constructor

* Fix test_constructor_coercion_signed_to_unsigned

* Fix test_constructor_coercion_signed_to_unsigned

* Cast numpy scalars as python scalars before arith ops

* add xfail reason to TestCoercionFloat32

* only set promotion state for numpy > 2.0

* order was backwards

* Version promotion state call

* fix timedelta tests

* go for green

* fix non npdev too?

* fixes

* adjust xfail condition

* go for green

* add tests

* add negative numbers test

* updates

* fix accidental changes

* more

* simplify

* linter

---------

Co-authored-by: Thomas Li <[email protected]>
Comment on lines -34 to +35
"numpy>=1.23.2,<2; python_version=='3.11'",
"numpy>=1.26.0,<2; python_version>='3.12'",
"numpy>=1.22.4; python_version<'3.11'",
"numpy>=1.23.2; python_version=='3.11'",
"numpy>=1.26.0; python_version>='3.12'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removed the numpy<2 pin again, which is good for the development branch (and nightly wheels), but shouldn't we still keep the pin for the release branch?

Unless we are planning to build the pandas 2.2.0 wheels with nightly numpy (which isn't the case right now), the upcoming pandas 2.2.0 release is still not ABI compatible with numpy 2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right by my understanding of packaging around the NumPy 2 release. Should keep the pin on until wheels can be built against NumPy 2.0.0rc1. Then can build against NumPy>=2.0.0rc1 and can have runtime dependency of NumPy 1.2x+.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly off topic but do we know what prevents us from using build isolation with meson? That would seemingly make this issue easier to manage as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removed the numpy<2 pin again, which is good for the development branch (and nightly wheels), but shouldn't we still keep the pin for the release branch?

Unless we are planning to build the pandas 2.2.0 wheels with nightly numpy (which isn't the case right now), the upcoming pandas 2.2.0 release is still not ABI compatible with numpy 2.0

Yeah, I just misjudged the timing for the numpy 2.0 release.
(Originally I think their tracker issue said that the RC would happen in January, which we could then build against to have compat with both numpy 1/2).

I'll put a PR up to put back the pin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly off topic but do we know what prevents us from using build isolation with meson? That would seemingly make this issue easier to manage as well

Sorry, not quite following here.
IIUC, build isolation means building the package in a separate environment from the one where it's installed in.

We currently just disable it for editable builds (otherwise the auto-rebuilding wouldn't work), and also on regular CI (since there's no point in installing the build tools again when they're already installed in the conda environment).

I'm pretty sure cibuildwheel might use it, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there's no point in installing the build tools again when they're already installed in the conda environment)

The benefit depends on if you consider the conda environment to be a build environment or a runtime environment. The advantage of an isolated build environment is you can build your project against any particular version like oldest-supported-numpy for binary compatability but choose a different version for the runtime.

I haven't closely tracked the particular case of NumPy 2.0 but it sounds like that could be useful based off of @bashtage comment #55739 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker for rc Blocking issue or pull request for release candidate Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas nightly wheel requires numpy<2
8 participants