-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow String Array that is compatible with NumPy semantics #54533
Implement Arrow String Array that is compatible with NumPy semantics #54533
Conversation
pandas/core/arrays/string_arrow.py
Outdated
def __getattribute__(self, item): | ||
if item in ArrowStringArrayMixin.__dict__: | ||
return partial(getattr(ArrowStringArrayMixin, item), self) | ||
return super().__getattribute__(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for doing it this way instead of actually adding the mixin class inheritance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrowStringArray inherits from ArrowExtensionArray as well, that will raise an error because the order can't be figured out. Not suer if there is a workaround that I don't know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, giving a "diamond" inheritance. Yeah, that doesn't work. I think you could fix that by making a BaseArrowExtensionArray that doesn't yet inherit from the mixin, and then ArrowExtensionArray adds in the mixin. And then here ArrowStringArray could inherit from that base class, or something like that? Not sure that is worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment in the code with that explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
You will also need to update the na_value use for the dtype: --- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -101,7 +101,10 @@ class StringDtype(StorageExtensionDtype):
#: StringDtype().na_value uses pandas.NA
@property
def na_value(self) -> libmissing.NAType:
- return libmissing.NA
+ if self.storage == "pyarrow_numpy":
+ return np.nan
+ else:
+ return libmissing.NA That eg ensures that getitem returns a NaN instead of pd.NA |
Flying today, will look tomorrow/thursday |
@jorisvandenbossche would you be ok with doing this as a follow up? I tested this on my train ride but it broke a bunch of tests (obviously), which would make review even more complicated |
See #54585 for the na value |
pandas/core/arrays/string_arrow.py
Outdated
def _result_converter(values, na=None): | ||
if not isna(na): | ||
values = values.fill_null(bool(na)) | ||
return ArrowExtensionArray(values).to_numpy(na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only if not isna(na)
? I assume we will generally always have to fill nulls? (otherwise the to_numpy
with nan gives object dtype)
For example (using this branch):
n [19]: pd.Series(["a", "b", None], dtype="string[pyarrow]").str.startswith("a")
Out[19]:
0 True
1 False
2 <NA>
dtype: boolean
In [20]: pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
Out[20]:
0 True
1 False
2 NaN
dtype: object
If we want that the second example has a proper numpy bool dtype that can be used for boolean indexing, we probably should convert the NaN into False? (either with the fill_null
call or with to_numpy(na_value=False)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that. This would break behaviour compared to object dtype, which is something I tried to avoid (and will break a huge number of tests). I think it makes more sense to keep as is for consistency sake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, so also the current object-dtype string methods propagate the NaN, wasn't aware of that:
In [5]: pd.Series(["a", "b", None], dtype="object").str.startswith("a")
Out[5]:
0 True
1 False
2 None
dtype: object
Personally, I think that's something we should change, but that's for another issue then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that
Opened #54805 for the question whether string predicate methods like startswith should propagate NaN or not
pandas/core/arrays/string_arrow.py
Outdated
def __getattribute__(self, item): | ||
if item in ArrowStringArrayMixin.__dict__: | ||
return partial(getattr(ArrowStringArrayMixin, item), self) | ||
return super().__getattribute__(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment in the code with that explanation?
@@ -500,7 +500,7 @@ def use_inf_as_na_cb(key) -> None: | |||
"string_storage", | |||
"python", | |||
string_storage_doc, | |||
validator=is_one_of_factory(["python", "pyarrow"]), | |||
validator=is_one_of_factory(["python", "pyarrow", "pyarrow_numpy"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the user to allow setting this for the new option as well? Because we also have the new pd.options.future.infer_strings
option to enable the future string dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say yes, you might want to astype columns after operations for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably something to further discuss in a follow up issue, but I would expect that if you opt-in for the future string dtype with pd.options.future.infer_strings
, that this would also automatically set the "pyarrow_numpy" string storage as default for operations that depend on that (like doing astype("string")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree, let's do this as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #54793 for this interaction between pd.options.future.infer_strings
and the default string_storage
tm.assert_extension_array_equal(result, expected) | ||
|
||
if dtype.storage == "pyarrow_numpy": | ||
expected = np.array([False, False, False], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be bool instead of object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair, fixed that.
""" | ||
Parametrized fixture for pd.options.mode.string_storage. | ||
|
||
* 'python' | ||
* 'pyarrow' | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is added to override the version in the top-level conftest.py to keep only testing "python" and "pyarrow", and that is because the IO methods don't yet properly support "pyarrow_numpy"?
If so, can you add a comment about that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work, but tests are tiresome and don't really add value.
The previous tests just checked that string_storage isn't ignored, having 2 or 3 options doesn't make a difference
if any_string_dtype == "string[pyarrow_numpy]": | ||
pytest.skip("Arrow logic is different") | ||
s = Series(["a", "bb", "cccc", "ddddd", "eeeeee"], dtype=any_string_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case the "string[pyarrow]"
version (ArrowStringArray) was still falling back to the python object-dtype implementation? (since we are not skipping that one)
This might be a case where we could add an option to the pyarrow compute kernel that determines if it's aligned left or right for uneven centering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the cleanest solution, yes "string[pyarrow]" fell back to python objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but I think we should consider for the default string dtype for 3.0 to preserve the current behaviour (and so if necessary for now fallback to the python objects), if our preferred option would be to keep this behaviour long term through a keyword in pyarrow.
(otherwise that would be a change in behaviour again later, when pyarrow would support that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the above center
behaviour, opened #54807 to discuss changing this to fall back for now (to preserve the behaviour)
Co-authored-by: Joris Van den Bossche <[email protected]>
…semantics # Conflicts: # pandas/tests/arrays/string_/test_string.py # pandas/tests/extension/test_string.py
…semantics # Conflicts: # pandas/tests/strings/__init__.py
Something that also still needs to be done (for this topic, not necessarily in this PR) is to ensure that the constructors and IO methods that follow |
cc @jbrockmendel this should be ready behaviour wise now |
@jbrockmendel @jorisvandenbossche are we good to merge here? |
Co-authored-by: Joris Van den Bossche <[email protected]>
Yes, let's merge this. You have some follow-up PRs in the line anyway, so any further comment can be addressed in those, I think. (and I'll open some issues for some of the comments I made) |
@meeseeksdev backport to 2.1.x |
…mpatible with NumPy semantics
I'll put up a pr for the infer option and the whatsnew later today |
… is compatible with NumPy semantics) (#54713) Backport PR #54533: Implement Arrow String Array that is compatible with NumPy semantics Co-authored-by: Patrick Hoefler <[email protected]>
Opened #54792 as a general follow-up issue tracking the new string dtype topic. And will open some other issues for a few of the remaining discussion items above. |
cc @jbrockmendel @jorisvandenbossche
I'll do a couple of pre-cursors to reduce the diff
Context: #54792