BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

phofl · 2023-11-05T14:43:19Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

idx = pd.Index(["a", "b", None], dtype="string[pyarrow_numpy]")
idx.get_indexer([None])

Issue Description

This returns -1 for the new arrow string dtype because we cast None to np.nan when creating the array. We wanted to be as close to object dtype as possible, but patching this does not seem ideal. Thoughts?

cc @jorisvandenbossche

Expected Behavior

see above

Installed Versions

Replace this line with the output of pd.show_versions()

jorisvandenbossche · 2023-11-06T09:38:58Z

In general (for non-object dtypes), it seems we are strict in which NA-like values to accept here:

In [7]: idx = pd.Index([0.1, 0.2, None], dtype="float64")

In [8]: idx.get_indexer([np.nan])
Out[8]: array([2])

In [9]: idx.get_indexer([None])
Out[9]: array([-1])

But I agree that for the migration from object dtype, we should maybe be more flexible here, and also find None values.

Although even then it will not exactly preserve behaviour. Because right now an object dtype Index can hold both None and NaN (and get_indexer finds them separately), and those will become both NaN, letting get_indexer find both. No way around that.

Dhayanidhi-M · 2023-11-07T07:41:18Z

Hi,

If dataframe values of None converts to np.nan , then get_indexer function should change the None to np.nan before searching the full dataframe.
In base.py , we have the get_indexer function , if we add the following lines of code , might help,

for i,v in enumerate(target):
if v == None:
target[i] = np.nan

hvsesha · 2023-11-07T07:43:33Z

@dhayanidhi
It is working after i changed in base.py in get_indexer function

hvsesha · 2023-11-07T07:44:19Z

@phofl Kindly check from your side

phofl · 2023-11-07T13:16:42Z

That needs a fix at a different place. This is probably not a good issue to get started, I would recommend that you look for issues that are labeled Good first issue

phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member Strings String extension data type and string data Arrow pyarrow functionality API Design and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Nov 5, 2023

jorisvandenbossche mentioned this issue Jul 29, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

phofl commented Nov 5, 2023 •

edited

Loading

jorisvandenbossche commented Nov 6, 2023

Dhayanidhi-M commented Nov 7, 2023

hvsesha commented Nov 7, 2023

hvsesha commented Nov 7, 2023

phofl commented Nov 7, 2023

BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

Comments

phofl commented Nov 5, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

jorisvandenbossche commented Nov 6, 2023

Dhayanidhi-M commented Nov 7, 2023

hvsesha commented Nov 7, 2023

hvsesha commented Nov 7, 2023

phofl commented Nov 7, 2023

phofl commented Nov 5, 2023 •

edited

Loading