Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.get_indexer will change behaviour for nulls with arrow strings #55833

Open
3 tasks done
Tracked by #54792
phofl opened this issue Nov 5, 2023 · 5 comments
Open
3 tasks done
Tracked by #54792
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Milestone

Comments

@phofl
Copy link
Member

phofl commented Nov 5, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

idx = pd.Index(["a", "b", None], dtype="string[pyarrow_numpy]")
idx.get_indexer([None])

Issue Description

This returns -1 for the new arrow string dtype because we cast None to np.nan when creating the array. We wanted to be as close to object dtype as possible, but patching this does not seem ideal. Thoughts?

cc @jorisvandenbossche

Expected Behavior

see above

Installed Versions

Replace this line with the output of pd.show_versions()

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member Strings String extension data type and string data Arrow pyarrow functionality API Design and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Nov 5, 2023
@jorisvandenbossche
Copy link
Member

In general (for non-object dtypes), it seems we are strict in which NA-like values to accept here:

In [7]: idx = pd.Index([0.1, 0.2, None], dtype="float64")

In [8]: idx.get_indexer([np.nan])
Out[8]: array([2])

In [9]: idx.get_indexer([None])
Out[9]: array([-1])

But I agree that for the migration from object dtype, we should maybe be more flexible here, and also find None values.

Although even then it will not exactly preserve behaviour. Because right now an object dtype Index can hold both None and NaN (and get_indexer finds them separately), and those will become both NaN, letting get_indexer find both. No way around that.

@Dhayanidhi-M
Copy link

Hi,

If dataframe values of None converts to np.nan , then get_indexer function should change the None to np.nan before searching the full dataframe.
In base.py , we have the get_indexer function , if we add the following lines of code , might help,

for i,v in enumerate(target):
if v == None:
target[i] = np.nan

@hvsesha
Copy link

hvsesha commented Nov 7, 2023

@dhayanidhi
It is working after i changed in base.py in get_indexer function

@hvsesha
Copy link

hvsesha commented Nov 7, 2023

@phofl Kindly check from your side

@phofl
Copy link
Member Author

phofl commented Nov 7, 2023

That needs a fix at a different place. This is probably not a good issue to get started, I would recommend that you look for issues that are labeled Good first issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants