Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow String Array that is compatible with NumPy semantics #54533
Implement Arrow String Array that is compatible with NumPy semantics #54533
Changes from 12 commits
b24afc9
b306c6f
2dbcfb0
d9e61e5
3188c25
df231f0
cd19bfb
c73c6b0
6b26309
da6d67c
4be0ee8
6cf2639
0c260fb
8df070a
1e732c9
5b0d24c
3a913e1
a95d2ee
ec56cef
31a59c4
ed77967
d05c51d
aad3d2e
5383eeb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only
if not isna(na)
? I assume we will generally always have to fill nulls? (otherwise theto_numpy
with nan gives object dtype)For example (using this branch):
If we want that the second example has a proper numpy bool dtype that can be used for boolean indexing, we probably should convert the NaN into False? (either with the
fill_null
call or withto_numpy(na_value=False)
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that. This would break behaviour compared to object dtype, which is something I tried to avoid (and will break a huge number of tests). I think it makes more sense to keep as is for consistency sake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, so also the current object-dtype string methods propagate the NaN, wasn't aware of that:
Personally, I think that's something we should change, but that's for another issue then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #54805 for the question whether string predicate methods like startswith should propagate NaN or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the user to allow setting this for the new option as well? Because we also have the new
pd.options.future.infer_strings
option to enable the future string dtype?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say yes, you might want to astype columns after operations for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably something to further discuss in a follow up issue, but I would expect that if you opt-in for the future string dtype with
pd.options.future.infer_strings
, that this would also automatically set the "pyarrow_numpy" string storage as default for operations that depend on that (like doingastype("string")
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree, let's do this as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #54793 for this interaction between
pd.options.future.infer_strings
and the defaultstring_storage