-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: setting 1.02 into float32 Series upcasts to float64 #55679
Comments
Yes you are right .Below code will remove the warning
|
@hvsesha I think maybe your comment got mangled. I'm not sure exactly what you are suggesting, but I don't see that changing the dtype from "f4" to "float32" makes any difference. I am suppressing the warning in my own code by adding a |
I am having the same problem as @batterseapower. Will match the index column and print deprecation warning message for each numeric value, except 0.0 (!?) as follows: |
thanks for the report I think this is a case when the warning is correct (the value being set is indeed changing the dtype), but it's uncovered a related bug (that the dtype shouldn't be changing in this case) In [4]: x = pd.Series(index=['a', 'b'], dtype='float32')
In [5]: x
Out[5]:
a NaN
b NaN
dtype: float32
In [6]: x.iloc[0] = .02
<ipython-input-6-5b8507309c35>:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.02' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
x.iloc[0] = .02
In [7]: x.dtype
Out[7]: dtype('float64') Looks like, for nullable dtypes, the upcasting doesn't happen: In [8]: x = pd.Series(index=['a', 'b'], dtype='Float32')
In [9]: x.iloc[0] = .02
In [10]: x
Out[10]:
a 0.02
b <NA>
dtype: Float32 |
We end up with float64 because of pandas/pandas/core/dtypes/cast.py Lines 1829 to 1834 in 0cbe41a
The check
|
I think in np_can_hold_element we should be able to check if the value is in the |
The issue is that if we only check the bounds then we get these kinds of results: In [9]: ser = Series([1.1, 2.2, 3.3, 4.4], dtype=np.float32)
In [10]: ser[0] = np.uint32(np.iinfo(np.uint32).max)
In [11]: ser[0]
Out[11]: 4294967300.0
In [12]: np.uint32(np.iinfo(np.uint32).max)
Out[12]: 4294967295 To be honest I think the warning (and future error) is correct here, as float32 can't hold 0.2: In [28]: np.float32(.2).item()
Out[28]: 0.20000000298023224
In [29]: np.float64(.2).item()
Out[29]: 0.2 |
float64 also cannot hold 0.2. A number can only be represented if it can be expressed as |
I really think the correct behaviour would be for pandas to stop changing the dtype from f4 to f8, and simply assign to the existing array. This is:
And it's not normal for Pandas to be fussy about small losses of precision, e.g. this prints -1:
|
thanks for your inputs - going to cc @jbrockmendel then as he introduced the |
Discussed on yesterday's dev call. The conclusion seemed to be keeping the current behavior: the policy is we coerce when doing so can be done losslessly, otherwise we cast the target (which is deprecated and will raise in the future). There was very little support for the idea of special-casing this policy for python floats. |
Up to you I guess! It just seems strange that Pandas is going to be fussy about this small loss of precision, which is completely uncontroversal in numpy, while other parts of the codebase are happy to make much more dangerous conversions. I gave an example above with
|
w/r/t the numpy behavior referneced #55679 (comment), the pandas behavior in
We catch this in dt64/td64 operations but not in integer (or float) ops. I think with pyarrow dtypes some of these raise rather than silently overflow. id be on board with changing our behavior to not silently overflow if the performance impact isn't too bad.
Thats a genuinely hard problem. To avoid the lossy casting there we would need to check for too-big ints and choose some other dtype (object? float128? or just raise?). This (at least the non-raising options) would constitute "value dependent behavior" that we have generally tried to avoid (and are trying to move away from in places where it still exists). But there's a decent case to be made for being stricter about non-lossiness. IIRC Index setops use slightly different casting logic to avoid lossiness. |
While consistency is a good reason to not want to special case floats, I think there are also very good reason to deviate in this case (and floating point data just in themselves already are one big special case of floating-point specific semantics ..).
Indeed. Both float64 and float32 can hold 0.2, or both cannot hold it, depending on what you mean with "hold". Both can represent a value that is "relatively close to" 0.2, and neither can represent the decimal 0.2 exactly. Only for float64 this "close to" is even closer than for float32. The reason that you see something confusing as: In [28]: np.float32(.2).item()
Out[28]: 0.20000000298023224
In [29]: np.float64(.2).item()
Out[29]: 0.2 seemingly indicating float64 can hold 0.2 and float32 not, is just specific to the When converting the float32 to a Python float (float64) as in the example above, Python will now show more reprs, because the nearest approximation for 0.2 in float32 is farther away from decimal 0.2, and once converted to float64 this falls outside of the range +/- episilon around 0.2, and thus doesn't show 0.2. In [67]: np.float32(0.2)
Out[67]: 0.2
In [68]: np.float64(0.2)
Out[68]: 0.2 A long wall of text to say that "0.2" is neither a float32 and float64 value, so should we then allow setting that for float64 and not for float32? However, when a user executes something like
I think the user intention when they write the characters Note that the current behaviour also means you can't even assign the value as it is shown in the repr for float32: In [2]: ser = Series([0.1, 0.2, 0.3], dtype="float32")
In [3]: ser
Out[3]:
0 0.1
1 0.2
2 0.3
dtype: float32
In [4]: ser[0] = 0.1 # <-- warns, although "0.1" was copied out of the repr
<ipython-input-5-7f9596758898>:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.1' has dtype incompatible with float32, please explicitly cast to a compatible dty |
Apologies in advance for the long wall of text that is coming, but some additional related points, summarized as:
Full version: 1. How do decide when a setitem would upcast or not? I think in general the idea should be that if the RHS can be cast "safely" to the dtype of the values that we are setting into, then we do this cast and don't upcast and warn (error in the future). This is related to the discussion on "safe" casting for astype/constructors: #45588. (EDIT: opened an issue for this more general discussion: #55935) But if we assume for a moment that this casting safety is the criterion for numeric data, then the question is: how do we determine this in the case of floats? For casting floats to integers, there is a nice logic to determine whether the cast can be done safely, i.e. if the roundtrip still evaluates as equal: >>> val = np.float64(1.0)
>>> val_casted = np.int64(val)
>>> val == np.float64(val_casted)
True This is more or less what we use in However, while this works for ints, equality is a bit more tricky for different float dtypes. In general, the "same" decimal number represented as float32 or float64 is never equal: >>> np.float32(0.2) == np.float64(0.2)
False Only for some very specific values where all the extra bits of a float64 are 0 compared to the float32 representation, a float32 and float64 can represent the exact same approximation and will compare equal. Example (from https://stackoverflow.com/a/44922312/653364): >>> np.float32(58682.7578125) == np.float64(58682.7578125)
True
# changing the last decimal from 5 to 6 for both left/right value -> no longer "equal"
>>> np.float32(58682.7578126) == np.float64(58682.7578126)
False So in general, equality doesn't really work between float32 and float64 values (the float32 gets cast to float64, and that changes the precision). Thus, if we use the same logic as for a "float -> int" cast to determine if we can safely cast a float64 value to float32, that would almost always be considered unsafe. When comparing floats robustly, you generally need to use something like Also, for array equality, we do consider float32/float64 values as equal. Using the example from the previous post where setting 0.2 would raise a warning, comparing equality with the same value is fine:
So 2. NumPy will start consider Python scalars as "weakly" typed with NEP50 However, in the future, numpy will change this a bit with NEP 50 — Promotion rules for Python scalars, and Python scalars like Taking the example of equality >>> arr = np.array([0.1, 0.2, 0.3], dtype="float32")
>>> arr == 0.2
array([False, True, False])
# this translates more or less to:
>>> arr == np.float64(0.2).astype("float32")
array([False, True, False]) With the new rules, the float64 scalar will no longer be special cases, and >>> np._set_promotion_state("weak")
>>> arr == np.float64(0.2)
array([False, False, False]) but comparing with the Python scalar will still work as expected, because that one is now, as weakly typed, cast to float32: >>>arr == 0.2
array([False, True, False])
# this now translates more or less to:
>>> arr == np.float32(0.2)
array([False, True, False]) While numpy does not use this for setitem, I think the fact that numpy will start considering Python scalars as weakly typed when used as operands in ufuncs, is a good reason for us to also consider them as "weakly" typed in setitem operations In my post just above, I wrote:
When using the rule of "weakly" typed for Python built-in scalars, the above argument doesn't apply anymore (apart from the other reasons I mentioned in that post that I think we should not follow that, e.g. because that's typically not the intention of the user). |
sounds good! |
Looks like with numpy 2.0, the issue fixes itself (I think they might have relaxed their tolerances a little).
Do we still want to patch this, or are we fine letting this resolve naturally? |
I guess if we wanted to patch this, we'd have to figure out what tolerances numpy is using. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
In 2.1.1, this code prints the following warning:
It then shows that the post-assignment dtype is
float64
, which is presumably why the warning is being raised: Pandas is converting the backing array from f4 to f8 behind the scenes for some reason.Do you really intend this to raise an exception in the future? It seems like it would be better to just assign the literal to to the existing float32 array. I checked what earlier versions of Pandas did, and it looks like Pandas 1.4.4 kept the dtype as float32 after the assignment, while Pandas 1.5.2 changes the dtype to float64. I'm not sure whether that was intentional or not - it doesn't seem to be explicitly called out in the release notes for e.g. 1.5.0 (though it's possible I missed something).
The warning can of course be suppressed by writing
np.float32(1.02)
instead of the simple literal, but it doesn't seem great to require users to write this kind of stuff for every literal assignment.Expected Behavior
Just assign the literal to the float32 array with no warning and no exception about dtype conversion.
Installed Versions
pandas : 2.1.1
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.0
pytest : 7.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.13.2
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : 1.10.1
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.4.2
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: