Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: setting 1.02 into float32 Series upcasts to float64 #55679

Open
2 of 3 tasks
batterseapower opened this issue Oct 25, 2023 · 19 comments
Open
2 of 3 tasks

BUG: setting 1.02 into float32 Series upcasts to float64 #55679

batterseapower opened this issue Oct 25, 2023 · 19 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@batterseapower
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

x = pd.Series(index=['a', 'b'], dtype='f4')
x.iloc[0] = 1.02
print(x.dtype)

Issue Description

In 2.1.1, this code prints the following warning:

FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '1.02' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.

It then shows that the post-assignment dtype is float64, which is presumably why the warning is being raised: Pandas is converting the backing array from f4 to f8 behind the scenes for some reason.

Do you really intend this to raise an exception in the future? It seems like it would be better to just assign the literal to to the existing float32 array. I checked what earlier versions of Pandas did, and it looks like Pandas 1.4.4 kept the dtype as float32 after the assignment, while Pandas 1.5.2 changes the dtype to float64. I'm not sure whether that was intentional or not - it doesn't seem to be explicitly called out in the release notes for e.g. 1.5.0 (though it's possible I missed something).

The warning can of course be suppressed by writing np.float32(1.02) instead of the simple literal, but it doesn't seem great to require users to write this kind of stuff for every literal assignment.

Expected Behavior

Just assign the literal to the float32 array with no warning and no exception about dtype conversion.

Installed Versions

INSTALLED VERSIONS ------------------ commit : e86ed37 python : 3.10.10.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.20.1.el8_5.x86_64 Version : #1 SMP Thu Mar 10 20:59:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.1.1
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.0
pytest : 7.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.13.2
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : 1.10.1
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.4.2
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

@batterseapower batterseapower added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 25, 2023
@hvsesha
Copy link

hvsesha commented Oct 25, 2023

Hi @batterseapower

Yes you are right .Below code will remove the warning
x = pd.Series(index=['a', 'b'], dtype='float32')

print(x.dtype)

@batterseapower
Copy link
Contributor Author

@hvsesha I think maybe your comment got mangled. I'm not sure exactly what you are suggesting, but I don't see that changing the dtype from "f4" to "float32" makes any difference.

I am suppressing the warning in my own code by adding a np.float32() cast around my literals, but it doesn't seem like good behaviour to warn about this or break this kind of code in a future Pandas version.

@csotto
Copy link

csotto commented Oct 25, 2023

I am having the same problem as @batterseapower.
On my case I am updating a DataFrame row (index dtype int64 and all columns are dtype float32) from a list.
Given:
result=[1698262980000, 0.861, 0.861, 0.857, 0.858, 19336.7, 0.0, 0.0, 8295.444, 0.0, 8295.444, 0.0]
df.loc[result[0], "open" : "closed"] = result[1:] # 'open' <=> first column, 'closed' <=> last column to be updated

Will match the index column and print deprecation warning message for each numeric value, except 0.0 (!?) as follows:
FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.861' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
.
I am using pandas v2.1.1. My expected behaviour is exactly the same stated by @batterseapower

@rhshadrach
Copy link
Member

cc @MarcoGorelli

@rhshadrach rhshadrach added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 26, 2023
@MarcoGorelli MarcoGorelli changed the title BUG: "Setting an item of incompatiable dtype" warning for float32 literal assignment BUG: setting 1.02 into float32 Series upcasts to float64 Oct 26, 2023
@MarcoGorelli
Copy link
Member

thanks for the report

I think this is a case when the warning is correct (the value being set is indeed changing the dtype), but it's uncovered a related bug (that the dtype shouldn't be changing in this case)

In [4]: x = pd.Series(index=['a', 'b'], dtype='float32')

In [5]: x
Out[5]:
a   NaN
b   NaN
dtype: float32

In [6]: x.iloc[0] = .02
<ipython-input-6-5b8507309c35>:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.02' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  x.iloc[0] = .02

In [7]: x.dtype
Out[7]: dtype('float64')

Looks like, for nullable dtypes, the upcasting doesn't happen:

In [8]: x = pd.Series(index=['a', 'b'], dtype='Float32')

In [9]: x.iloc[0] = .02

In [10]: x
Out[10]:
a    0.02
b    <NA>
dtype: Float32

@jorisvandenbossche jorisvandenbossche removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 26, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.1.3 milestone Oct 26, 2023
@rhshadrach
Copy link
Member

We end up with float64 because of np_can_hold_element:

if lib.is_integer(element) or lib.is_float(element):
casted = dtype.type(element)
if np.isnan(casted) or casted == element:
return casted
# otherwise e.g. overflow see TestCoercionFloat32
raise LossySetitemError

The check casted == element fails there for anything that doesn't have a clean binary representation. E.g.

x = pd.Series(index=['a', 'b'], dtype='float32')
x.iloc[0] = 0.5
print(x.dtype)
float32

@mroeschke
Copy link
Member

I think in np_can_hold_element we should be able to check if the value is in the np.finfo bounds instead of using casted == element

@MarcoGorelli
Copy link
Member

I think in np_can_hold_element we should be able to check if the value is in the np.finfo bounds instead of using casted == element

The issue is that if we only check the bounds then we get these kinds of results:

In [9]: ser = Series([1.1, 2.2, 3.3, 4.4], dtype=np.float32)

In [10]: ser[0] = np.uint32(np.iinfo(np.uint32).max)

In [11]: ser[0]
Out[11]: 4294967300.0

In [12]: np.uint32(np.iinfo(np.uint32).max)
Out[12]: 4294967295

To be honest I think the warning (and future error) is correct here, as float32 can't hold 0.2:

In [28]: np.float32(.2).item()
Out[28]: 0.20000000298023224

In [29]: np.float64(.2).item()
Out[29]: 0.2

@MarcoGorelli MarcoGorelli added the Closing Candidate May be closeable, needs more eyeballs label Nov 1, 2023
@MarcoGorelli MarcoGorelli removed this from the 2.1.3 milestone Nov 1, 2023
@rhshadrach
Copy link
Member

To be honest I think the warning (and future error) is correct here, as float32 can't hold 0.2:

float64 also cannot hold 0.2. A number can only be represented if it can be expressed as M * 2**e for integers M and e (with certain bounds on M and e), but 1/5 * 2^-e is never an integer.

@batterseapower
Copy link
Contributor Author

I really think the correct behaviour would be for pandas to stop changing the dtype from f4 to f8, and simply assign to the existing array. This is:

  • Less surprising for Numpy users: I think we can agree it would be really weird if np.zeros(1, dtype='f4')[0] = 1.02 threw an error
  • It's how Pandas itself used to work pre 1.5, when the behaviour seems to have been changed by accident

And it's not normal for Pandas to be fussy about small losses of precision, e.g. this prints -1:

np.int64(pd.concat([
    pd.Series([9007199254740993], dtype='i8'),
    pd.Series([1.5], dtype='f8'),
]).iloc[0]) - 9007199254740993

@MarcoGorelli
Copy link
Member

thanks for your inputs - going to cc @jbrockmendel then as he introduced the casted == element check

@jbrockmendel
Copy link
Member

Discussed on yesterday's dev call. The conclusion seemed to be keeping the current behavior: the policy is we coerce when doing so can be done losslessly, otherwise we cast the target (which is deprecated and will raise in the future). There was very little support for the idea of special-casing this policy for python floats.

@batterseapower
Copy link
Contributor Author

Up to you I guess! It just seems strange that Pandas is going to be fussy about this small loss of precision, which is completely uncontroversal in numpy, while other parts of the codebase are happy to make much more dangerous conversions. I gave an example above with concat, but this even affects primitives like addition:

assert (
    pd.Series([2**31 - 1], dtype='i4') + pd.Series([1], dtype='i4')
    <
    pd.Series([2**31 - 1], dtype='i4')
).all()

@jbrockmendel
Copy link
Member

w/r/t the numpy behavior referneced #55679 (comment), the pandas behavior in __setitem__ was designed (i think, it was before my time) in response to complaints about setting a float into an integer array silently coercing.

but this even affects primitives like addition

We catch this in dt64/td64 operations but not in integer (or float) ops. I think with pyarrow dtypes some of these raise rather than silently overflow. id be on board with changing our behavior to not silently overflow if the performance impact isn't too bad.

pd.concat([pd.Series([9007199254740993], dtype='i8'), pd.Series([1.5], dtype='f8')])

Thats a genuinely hard problem. To avoid the lossy casting there we would need to check for too-big ints and choose some other dtype (object? float128? or just raise?). This (at least the non-raising options) would constitute "value dependent behavior" that we have generally tried to avoid (and are trying to move away from in places where it still exists). But there's a decent case to be made for being stricter about non-lossiness. IIRC Index setops use slightly different casting logic to avoid lossiness.

@jorisvandenbossche
Copy link
Member

While consistency is a good reason to not want to special case floats, I think there are also very good reason to deviate in this case (and floating point data just in themselves already are one big special case of floating-point specific semantics ..).

To be honest I think the warning (and future error) is correct here, as float32 can't hold 0.2:

float64 also cannot hold 0.2.

Indeed. Both float64 and float32 can hold 0.2, or both cannot hold it, depending on what you mean with "hold". Both can represent a value that is "relatively close to" 0.2, and neither can represent the decimal 0.2 exactly. Only for float64 this "close to" is even closer than for float32.

The reason that you see something confusing as:

In [28]: np.float32(.2).item()
Out[28]: 0.20000000298023224

In [29]: np.float64(.2).item()
Out[29]: 0.2

seemingly indicating float64 can hold 0.2 and float32 not, is just specific to the repr, and that fact that a Python float (which is returned by item()) is a float64 under the hood(a c double).
When "printing" a float, there are many ways to do this. You can print the most exact decimal representation with many digits, or you can print something shorter for human display. For example for float64, the decimal 0.2 is approximated by a float64 value that is also the approximation of any decimal value between 0.19999999999999998 and 0.20000000000000004. And so the Python repr decides the use the "shortest" possible repr of values in that range, and thus shows "0.2" (Python actually didn't always do that in the past, and https://docs.python.org/3/tutorial/floatingpoint.html explains this for the value of "0.1", quoting: Just remember, even though the printed result looks like the exact value of 1/10, the actual stored value is the nearest representable binary fraction.)

When converting the float32 to a Python float (float64) as in the example above, Python will now show more reprs, because the nearest approximation for 0.2 in float32 is farther away from decimal 0.2, and once converted to float64 this falls outside of the range +/- episilon around 0.2, and thus doesn't show 0.2.
But when printing float32 in the same way as float64, using the same rule of showing the shortest repr depending on the precision of the specific type, like numpy does, you see numpy both prints them as "0.2":

In [67]: np.float32(0.2)
Out[67]: 0.2

In [68]: np.float64(0.2)
Out[68]: 0.2

A long wall of text to say that "0.2" is neither a float32 and float64 value, so should we then allow setting that for float64 and not for float32?
A reason to make this distinction could be because the user is setting a Python float, and this is actually a float64. So based on that knowledge, we could assume the user is setting a float64. And then the common type for float32 and float64 is float64.

However, when a user executes something like "ser[0] = 0.2", I don't they will typically mean "set a float64 into my series", but rather "set the value that represents 0.2 into my series".
And in that sense, I fully agree with what @batterseapower wrote in the top post:

The warning can of course be suppressed by writing np.float32(1.02) instead of the simple literal, but it doesn't seem great to require users to write this kind of stuff for every literal assignment.

I think the user intention when they write the characters 0.2 is typically to set that value in the series, regardless of float32 vs float64 concerns, and so requiring the user to be more specific by setting np.float32(0.2) is cumbersome and unnecessary.


Note that the current behaviour also means you can't even assign the value as it is shown in the repr for float32:

In [2]: ser = Series([0.1, 0.2, 0.3], dtype="float32")

In [3]: ser
Out[3]: 
0    0.1
1    0.2
2    0.3
dtype: float32

In [4]: ser[0] = 0.1   # <-- warns, although "0.1" was copied out of the repr
<ipython-input-5-7f9596758898>:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.1' has dtype incompatible with float32, please explicitly cast to a compatible dty

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 13, 2023

Apologies in advance for the long wall of text that is coming, but some additional related points, summarized as:

  1. For setitem operations, we should not use exact equality to determine if float32 and float64 values are equivalent (i.e. don't need an upcast, and thus not warn/error in the future), because exact equality for floats generally doesn't really work.
  2. Numpy will start treating Python scalars as "weakly" typed, and if we follow that logic, in an operation like ser[0] = 0.2, we can also consider this "0.2" as a weak float-like value, instead of a float64, value. And then when setting it into a float32 Series, we can simply coerce that Python scalar to the target dtype float32, instead of asking ourselves if the float64 value of 0.2 can be set or not.

Full version:

1. How do decide when a setitem would upcast or not?
In hindsight, I think the PDEP is not super clear on this (https://pandas.pydata.org/pdeps/0006-ban-upcasting.html). It essentially ties this decision to the current behaviour (If a setitem-like operation would previously have changed a Series' dtype, it would now raise.), but the current behaviour is not always correct (or can be upcasting too liberally for a world where we would ban upcasting in setitem).

I think in general the idea should be that if the RHS can be cast "safely" to the dtype of the values that we are setting into, then we do this cast and don't upcast and warn (error in the future). This is related to the discussion on "safe" casting for astype/constructors: #45588.
(although that is not exactly the rule we are using right now, because casting a string to float can probably be considered "safe" if the string values actually represent floats, but that's a case where we currently upcast to object, and thus warn that this will error in the future).

(EDIT: opened an issue for this more general discussion: #55935)

But if we assume for a moment that this casting safety is the criterion for numeric data, then the question is: how do we determine this in the case of floats?

For casting floats to integers, there is a nice logic to determine whether the cast can be done safely, i.e. if the roundtrip still evaluates as equal:

>>> val = np.float64(1.0)
>>> val_casted = np.int64(val)
>>> val == np.float64(val_casted)
True

This is more or less what we use in np_can_hold_element for float element in integer dtype (the snippet linked above), and that is for example also what Arrow uses for checking whether the float->int cast was safe.

However, while this works for ints, equality is a bit more tricky for different float dtypes. In general, the "same" decimal number represented as float32 or float64 is never equal:

>>> np.float32(0.2) == np.float64(0.2)
False

Only for some very specific values where all the extra bits of a float64 are 0 compared to the float32 representation, a float32 and float64 can represent the exact same approximation and will compare equal. Example (from https://stackoverflow.com/a/44922312/653364):

>>> np.float32(58682.7578125) == np.float64(58682.7578125)
True
# changing the last decimal from 5 to 6 for both left/right value -> no longer "equal"
>>> np.float32(58682.7578126) == np.float64(58682.7578126)
False

So in general, equality doesn't really work between float32 and float64 values (the float32 gets cast to float64, and that changes the precision). Thus, if we use the same logic as for a "float -> int" cast to determine if we can safely cast a float64 value to float32, that would almost always be considered unsafe.
Numpy also considers casts between float32 and float64 as not "safe" (astype casting keyword allows "same_kind" (one level stricter than "unsafe") for such casts, but not "safe"). As another example, Arrow generally has safe casting by default (and there options to turn off some safety checks), but doesn't actually check anything for casts between float types (it only checks truncation for int<->float casts).

When comparing floats robustly, you generally need to use something like np.isclose with a given tolerance.

Also, for array equality, we do consider float32/float64 values as equal. Using the example from the previous post where setting 0.2 would raise a warning, comparing equality with the same value is fine:

>>> ser = Series([0.1, 0.2, 0.3], dtype="float32")
>>> ser == 0.2
0    False
1     True
2    False
dtype: bool

So ser == 0.2 is fine, but ser[0] = 0.2 is not fine?

2. NumPy will start consider Python scalars as "weakly" typed with NEP50
One of the reasons we now raise a warning is because in the operation of ser[0] = 0.2, we consider "0.2" to be a float64 (that's how numpy infers the type of 0.2, if there is no other context).

However, in the future, numpy will change this a bit with NEP 50 — Promotion rules for Python scalars, and Python scalars like float will no longer be directly coerced as float64, but depending on the context (i.e. the dtype of the other operand in the operation, if that is a numpy typed object, has precedence).

Taking the example of equality ser == 0.2: in current numpy, this 0.2 is coerced to a float64 scalar, but then the == handles the scalar differently and gives precedence to the dtype of the array, i.e. float32. So essentially what happens is:

>>> arr = np.array([0.1, 0.2, 0.3], dtype="float32")
>>> arr == 0.2
array([False,  True, False])
# this translates more or less to:
>>> arr == np.float64(0.2).astype("float32")
array([False,  True, False])

With the new rules, the float64 scalar will no longer be special cases, and arr[float32] == scalar[float64] will "correctly" cast the operands to the higher precision of float64, resulting in non-equality:

>>> np._set_promotion_state("weak")
>>> arr == np.float64(0.2)
array([False, False, False])

but comparing with the Python scalar will still work as expected, because that one is now, as weakly typed, cast to float32:

>>>arr == 0.2
array([False,  True, False])
# this now translates more or less to:
>>> arr == np.float32(0.2)
array([False,  True, False])

While numpy does not use this for setitem, I think the fact that numpy will start considering Python scalars as weakly typed when used as operands in ufuncs, is a good reason for us to also consider them as "weakly" typed in setitem operations

In my post just above, I wrote:

... the user is setting a Python float, and this is actually a float64. So based on that knowledge, we could assume the user is setting a float64. And then the common type for float32 and float64 is float64.

When using the rule of "weakly" typed for Python built-in scalars, the above argument doesn't apply anymore (apart from the other reasons I mentioned in that post that I think we should not follow that, e.g. because that's typically not the intention of the user).

@MarcoGorelli
Copy link
Member

And then when setting it into a float32 Series, we can simply coerce that Python scalar to the target dtype float32, instead of asking ourselves if the float64 value of 0.2 can be set or not.

sounds good!

@MarcoGorelli MarcoGorelli removed the Closing Candidate May be closeable, needs more eyeballs label Nov 23, 2023
@lithomas1
Copy link
Member

Looks like with numpy 2.0, the issue fixes itself

(I think they might have relaxed their tolerances a little).

import pandas as pd
>>> import numpy as np
>>> np.__version__
'2.0.0.dev0+git20231124.94bc564'
>>> np_can_hold_element(np.dtype("float32"), 1.02)
np.float32(1.02)

Do we still want to patch this, or are we fine letting this resolve naturally?

@lithomas1
Copy link
Member

I guess if we wanted to patch this, we'd have to figure out what tolerances numpy is using.

@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.3, 2.3 Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

9 participants