Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Float64Dtype and isnull() functions unexpected behavior #53887

Open
3 tasks done
jayantsahewal opened this issue Jun 27, 2023 · 7 comments
Open
3 tasks done

BUG: Float64Dtype and isnull() functions unexpected behavior #53887

jayantsahewal opened this issue Jun 27, 2023 · 7 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@jayantsahewal
Copy link

jayantsahewal commented Jun 27, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

def weird_behavior(df, cols):
    print(df.dtypes.tolist())
    for col in cols:
        print(df[col].isna().sum(), df[col].apply(pd.isna).sum())
        print(df[col].isnull().sum(), df[col].apply(pd.isnull).sum())
    print()

test_1_df = pd.DataFrame(data=[[1.0, 0.0], [1.0, 1], [1.0, 1], [4.0, 0], [0.0, 0]], columns=['a', 'b'], dtype=pd.Float64Dtype())
test_1_df['c'] = test_1_df['a'] / test_1_df['b']
test_1_df['d'] = float('inf')
test_1_df['e'] = np.nan
test_2_df = test_1_df.astype(pd.Float64Dtype())
test_3_df = test_1_df.astype(np.float64)
cols = ['c', 'd', 'e']
weird_behavior(test_1_df, cols)
weird_behavior(test_2_df, cols)
weird_behavior(test_3_df, cols)

test_4_df = pd.DataFrame(data=[[1.0, 0.0], [1.0, 1], [1.0, 1], [4.0, 0], [0.0, 0]], columns=['a', 'b'])
test_4_df['c'] = test_4_df['a'] / test_4_df['b']
test_4_df['d'] = float('inf')
test_4_df['e'] = np.nan
test_5_df = test_4_df.astype(pd.Float64Dtype())
test_6_df = test_4_df.astype(np.float64)
cols = ['c', 'd', 'e']
weird_behavior(test_4_df, cols)
weird_behavior(test_5_df, cols)
weird_behavior(test_6_df, cols)

[Float64Dtype(), Float64Dtype(), Float64Dtype(), dtype('float64'), dtype('float64')]
0 1
0 1
0 0
0 0
5 5
5 5

[Float64Dtype(), Float64Dtype(), Float64Dtype(), Float64Dtype(), Float64Dtype()]
0 1
0 1
0 0
0 0
5 5
5 5

[dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64')]
1 1
1 1
0 0
0 0
5 5
5 5

[dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64')]
1 1
1 1
0 0
0 0
5 5
5 5

[Float64Dtype(), Float64Dtype(), Float64Dtype(), Float64Dtype(), Float64Dtype()]
1 1
1 1
0 0
0 0
5 5
5 5

[dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64')]
1 1
1 1
0 0
0 0
5 5
5 5


### Issue Description

Specifying `dtype=pd.Float64Dtype()` while creating a pandas dataFrame leads to inconsistent results in using `series.isnull()` vs. `series.apply(pd.isnull)`. 

If I don't specify the dtype while creating a pandas dataFrame, it works as expected in using `series.isnull()` vs. `series.apply(pd.isnull)`

### Expected Behavior

1 1
1 1
0 0
0 0
5 5
5 5


### Installed Versions

<details>

commit           : 965ceca9fd796940050d6fc817707bba1c4f9bff
python           : 3.10.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.10.157-139.675.amzn2.x86_64
Version          : #1 SMP Thu Dec 8 01:29:11 UTC 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.2
numpy            : 1.23.4
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.6.3
pip              : 23.1
Cython           : 0.29.33
pytest           : 7.2.0
hypothesis       : None
sphinx           : 6.1.2
blosc            : None
feather          : None
xlsxwriter       : 3.0.6
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : 1.0.3
psycopg2         : 2.9.3
jinja2           : 3.1.2
IPython          : 7.34.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : 1.3.5
brotli           : 
fastparquet      : None
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : 3.6.2
numba            : 0.56.4
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 8.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.10.0
snappy           : None
sqlalchemy       : 1.4.46
tables           : 3.7.0
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : 2.3.0
pyqt5            : None

</details>
@jayantsahewal jayantsahewal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 27, 2023
@jbrockmendel
Copy link
Member

xref #32265

@cmillani
Copy link

I just had an issue related to this while using isna on a DF.

On float64 dtype NaN is considered True:

import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [5, 0], 'b': [np.nan, 12]})
df = df.astype('float64')
df['c'] = df['a'] * np.inf
print(df.isna())

Results:

    a      b      c
0  False   True  False
1  False  False   True

Simply changing types:

df = df.astype('Float64')
df['c'] = df['a'] * np.inf
print(df.isna())

Changes the result of the columns C row 1:

     a      b      c
0  False   True  False
1  False  False  False

We solved this issue using numpy.isnan, but since Pandas is allowing Arrow as a backend this approach does not seems good, and I believe that consistency between Float64 and float64 would be a good choice.

I know little about the internals, but would be happy to help with if this is not a desired behavior! If this is desired, it would be nice to document and guide users :)

@rachtsingh
Copy link

@cmillani

You can fix this in your own codebase at the moment by making the following change:

import numpy as np
from pandas.core.arrays.floating import FloatingArray
FloatingArray.oldisna = FloatingArray.isna
def newisna(self: FloatingArray) -> np.ndarray:
    return np.isnan(self._data) | self._mask.copy()
FloatingArray.isna = newisna

For us this is part of our default pandas monkey patch module that is loaded in all code.

@rachtsingh
Copy link

For anyone else seeing this, also worth patching fillna:

from pandas.core.arrays.masked import BaseMaskedArrayT, validate_fillna_kwargs, is_array_like, missing
FloatingArray.oldfillna = FloatingArray.fillna
def newfillna(
    self: BaseMaskedArrayT, value=None, method=None, limit=None
) -> BaseMaskedArrayT:
    value, method = validate_fillna_kwargs(value, method)

    mask = (self._mask | np.isnan(self._data)).copy()

    if is_array_like(value):
        if len(value) != len(self):
            raise ValueError(
                f"Length of 'value' does not match. Got ({len(value)}) "
                f" expected {len(self)}"
            )
        value = value[mask]

    if mask.any():
        if method is not None:
            func = missing.get_fill_func(method, ndim=self.ndim)
            npvalues = self._data.copy().T
            new_mask = mask.T
            func(npvalues, limit=limit, mask=new_mask)
            return type(self)(npvalues.T, new_mask.T)
        else:
            # fill with value
            new_values = self.copy()
            new_values[mask] = value
    else:
        new_values = self.copy()
    return new_values
FloatingArray.fillna = newfillna

@lithomas1 lithomas1 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 9, 2023
@jbrockmendel jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 26, 2023
@DeGiorgiMarcello
Copy link

Hi, I have the same problem when filling nans generated by a 0/0 division:

In [2]: df = pd.DataFrame({"A":[1.,0,0,pd.NA],"B":[0,0,2,1]}, dtype="Float32")

In [3]: df
Out[3]: 
   A    B
0   1.0  0.0
1   0.0  0.0
2   0.0  2.0
3  <NA>  1.0

In [4]: z = df.A /df.B

In [5]: z
Out[5]: 
0    inf
1    NaN
2    0.0
3   <NA>
dtype: Float32

In [6]: z.fillna(0.)
Out[6]: 
0    inf
1    NaN
2    0.0
3    0.0
dtype: Float32

When filling the nan, the 0/0 nan is not filled.

Casting to float32 and recasting to Float32 solves the problem:

In [14]: z.astype("float32").astype("Float32").fillna(0.)
Out[14]: 
0    inf
1    0.0
2    0.0
3    0.0
dtype: Float32

@glaucouri
Copy link

glaucouri commented Jan 8, 2024

Some more info on this behavior: #32265

The v2.0 changelog mention this: https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#

Division by zero with ArrowDtype dtypes returns -inf, nan, or inf depending on the numerator, instead of raising (GH 51541)

To unify the nan values I've found another solution:

z.where(z==z, pd.NA)

This replaces all np.nan with the pd.NA values

@arobrien
Copy link

This has tripped me up and I consider it currently broken with pyarrow backend.

My current workarounds are to use df.astype(float).isna() or np.isnan(df), which give the expected behaviour.

Simple reproduction:

a = pd.DataFrame({'a': [1.0, 0.0, np.nan]})
b = a.convert_dtypes(dtype_backend='pyarrow')

# inconsistent behaviour for 0/0 but consistent for nan/0
display((a / 0).isna())
display((b / 0).isna())

# consistent behaviour 0/0 and nan/0 both yield True.
display(np.isnan(a / 0))
display(np.isnan(b / 0))
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

No branches or pull requests

8 participants