Implement Arrow String Array that is compatible with NumPy semantics #54533

phofl · 2023-08-13T19:59:00Z

I'll do a couple of pre-cursors to reduce the diff

Context: #54792

jorisvandenbossche · 2023-08-13T20:44:04Z

pandas/core/arrays/string_arrow.py

+    def __getattribute__(self, item):
+        if item in ArrowStringArrayMixin.__dict__:
+            return partial(getattr(ArrowStringArrayMixin, item), self)
+        return super().__getattribute__(item)


What's the reason for doing it this way instead of actually adding the mixin class inheritance?

ArrowStringArray inherits from ArrowExtensionArray as well, that will raise an error because the order can't be figured out. Not suer if there is a workaround that I don't know

Ah, yes, giving a "diamond" inheritance. Yeah, that doesn't work. I think you could fix that by making a BaseArrowExtensionArray that doesn't yet inherit from the mixin, and then ArrowExtensionArray adds in the mixin. And then here ArrowStringArray could inherit from that base class, or something like that? Not sure that is worth it

Can you add a brief comment in the code with that explanation?

pandas/core/arrays/string_arrow.py

jorisvandenbossche · 2023-08-13T21:01:10Z

You will also need to update the na_value use for the dtype:

--- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -101,7 +101,10 @@ class StringDtype(StorageExtensionDtype):
     #: StringDtype().na_value uses pandas.NA
     @property
     def na_value(self) -> libmissing.NAType:
-        return libmissing.NA
+        if self.storage == "pyarrow_numpy":
+            return np.nan
+        else:
+            return libmissing.NA

That eg ensures that getitem returns a NaN instead of pd.NA

…semantics # Conflicts: # pandas/core/arrays/_arrow_string_mixins.py # pandas/tests/strings/__init__.py # pandas/tests/strings/test_strings.py

jbrockmendel · 2023-08-14T11:11:22Z

Flying today, will look tomorrow/thursday

phofl · 2023-08-14T15:01:42Z

@jorisvandenbossche would you be ok with doing this as a follow up? I tested this on my train ride but it broke a bunch of tests (obviously), which would make review even more complicated

…semantics_2 # Conflicts: # pandas/tests/strings/test_find_replace.py

phofl · 2023-08-16T21:11:56Z

See #54585 for the na value

pandas/core/arrays/string_arrow.py

jorisvandenbossche · 2023-08-21T07:56:56Z

pandas/core/arrays/string_arrow.py

+    def _result_converter(values, na=None):
+        if not isna(na):
+            values = values.fill_null(bool(na))
+        return ArrowExtensionArray(values).to_numpy(na_value=np.nan)


Why only if not isna(na)? I assume we will generally always have to fill nulls? (otherwise the to_numpy with nan gives object dtype)

For example (using this branch):

n [19]: pd.Series(["a", "b", None], dtype="string[pyarrow]").str.startswith("a") Out[19]: 0 True 1 False 2 <NA> dtype: boolean In [20]: pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a") Out[20]: 0 True 1 False 2 NaN dtype: object

If we want that the second example has a proper numpy bool dtype that can be used for boolean indexing, we probably should convert the NaN into False? (either with the fill_null call or with to_numpy(na_value=False))

Not sure about that. This would break behaviour compared to object dtype, which is something I tried to avoid (and will break a huge number of tests). I think it makes more sense to keep as is for consistency sake

Ah, OK, so also the current object-dtype string methods propagate the NaN, wasn't aware of that:

In [5]: pd.Series(["a", "b", None], dtype="object").str.startswith("a") Out[5]: 0 True 1 False 2 None dtype: object

Personally, I think that's something we should change, but that's for another issue then.

Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that

Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that

Opened #54805 for the question whether string predicate methods like startswith should propagate NaN or not

jorisvandenbossche · 2023-08-21T07:58:52Z

pandas/core/arrays/string_arrow.py

+    def __getattribute__(self, item):
+        if item in ArrowStringArrayMixin.__dict__:
+            return partial(getattr(ArrowStringArrayMixin, item), self)
+        return super().__getattribute__(item)


Can you add a brief comment in the code with that explanation?

jorisvandenbossche · 2023-08-21T08:03:03Z

pandas/core/config_init.py

@@ -500,7 +500,7 @@ def use_inf_as_na_cb(key) -> None:
        "string_storage",
        "python",
        string_storage_doc,
-        validator=is_one_of_factory(["python", "pyarrow"]),
+        validator=is_one_of_factory(["python", "pyarrow", "pyarrow_numpy"]),


Do we want the user to allow setting this for the new option as well? Because we also have the new pd.options.future.infer_strings option to enable the future string dtype?

I'd say yes, you might want to astype columns after operations for example

Probably something to further discuss in a follow up issue, but I would expect that if you opt-in for the future string dtype with pd.options.future.infer_strings, that this would also automatically set the "pyarrow_numpy" string storage as default for operations that depend on that (like doing astype("string"))

Yeah agree, let's do this as a follow up

Opened #54793 for this interaction between pd.options.future.infer_strings and the default string_storage

jorisvandenbossche · 2023-08-21T08:22:15Z

pandas/tests/arrays/string_/test_string.py

-    tm.assert_extension_array_equal(result, expected)
+
+    if dtype.storage == "pyarrow_numpy":
+        expected = np.array([False, False, False], dtype=object)


Should this be bool instead of object?

Yeah that's fair, fixed that.

pandas/tests/arrays/string_/test_string.py

pandas/tests/arrays/test_datetimelike.py

pandas/conftest.py

jorisvandenbossche · 2023-08-21T08:35:44Z

pandas/tests/io/conftest.py

+    """
+    Parametrized fixture for pd.options.mode.string_storage.
+
+    * 'python'
+    * 'pyarrow'
+    """


I assume this is added to override the version in the top-level conftest.py to keep only testing "python" and "pyarrow", and that is because the IO methods don't yet properly support "pyarrow_numpy"?
If so, can you add a comment about that here?

It does work, but tests are tiresome and don't really add value.

The previous tests just checked that string_storage isn't ignored, having 2 or 3 options doesn't make a difference

jorisvandenbossche · 2023-08-21T08:44:47Z

pandas/tests/strings/test_case_justify.py

+    if any_string_dtype == "string[pyarrow_numpy]":
+        pytest.skip("Arrow logic is different")
    s = Series(["a", "bb", "cccc", "ddddd", "eeeeee"], dtype=any_string_dtype)


So in this case the "string[pyarrow]" version (ArrowStringArray) was still falling back to the python object-dtype implementation? (since we are not skipping that one)

This might be a case where we could add an option to the pyarrow compute kernel that determines if it's aligned left or right for uneven centering?

That would be the cleanest solution, yes "string[pyarrow]" fell back to python objects

Not for this PR, but I think we should consider for the default string dtype for 3.0 to preserve the current behaviour (and so if necessary for now fallback to the python objects), if our preferred option would be to keep this behaviour long term through a keyword in pyarrow.
(otherwise that would be a change in behaviour again later, when pyarrow would support that)

Regarding the above center behaviour, opened #54807 to discuss changing this to fall back for now (to preserve the behaviour)

Co-authored-by: Joris Van den Bossche <[email protected]>

…semantics # Conflicts: # pandas/tests/arrays/string_/test_string.py # pandas/tests/extension/test_string.py

…semantics # Conflicts: # pandas/tests/strings/__init__.py

jorisvandenbossche · 2023-08-21T11:17:47Z

Something that also still needs to be done (for this topic, not necessarily in this PR) is to ensure that the constructors and IO methods that follow pd.options.future.infer_string now use this pyarrow_numpy dtype instead of the plain pyarrow one?

phofl · 2023-08-21T17:57:09Z

cc @jbrockmendel this should be ready behaviour wise now

pandas/tests/strings/test_case_justify.py

pandas/core/arrays/string_arrow.py

phofl · 2023-08-22T18:40:40Z

@jbrockmendel @jorisvandenbossche are we good to merge here?

pandas/tests/strings/test_case_justify.py

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2023-08-23T16:43:04Z

Yes, let's merge this. You have some follow-up PRs in the line anyway, so any further comment can be addressed in those, I think.

(and I'll open some issues for some of the comments I made)

jorisvandenbossche · 2023-08-23T16:45:54Z

@meeseeksdev backport to 2.1.x

…mpatible with NumPy semantics

phofl · 2023-08-23T17:00:07Z

I'll put up a pr for the infer option and the whatsnew later today

… is compatible with NumPy semantics) (#54713) Backport PR #54533: Implement Arrow String Array that is compatible with NumPy semantics Co-authored-by: Patrick Hoefler <[email protected]>

jorisvandenbossche · 2023-08-28T10:18:05Z

Opened #54792 as a general follow-up issue tracking the new string dtype topic. And will open some other issues for a few of the remaining discussion items above.

phofl added 4 commits August 13, 2023 14:25

Start new string array

b24afc9

Add missing methods

b306c6f

Implement Arrow String Array that is compatible with NumPy semantics

2dbcfb0

Move methods

d9e61e5

jorisvandenbossche mentioned this pull request Aug 13, 2023

REF: Move methods that can be shared with new string dtype #54534

Merged

jorisvandenbossche reviewed Aug 13, 2023

View reviewed changes

Refactor

3188c25

jorisvandenbossche mentioned this pull request Aug 13, 2023

REF: Move checks to object into a variable #54536

Merged

phofl added 3 commits August 14, 2023 12:14

Merge remote-tracking branch 'upstream/main' into string_array_numpy_…

df231f0

…semantics # Conflicts: # pandas/core/arrays/_arrow_string_mixins.py # pandas/tests/strings/__init__.py # pandas/tests/strings/test_strings.py

Refactor

cd19bfb

Remove

c73c6b0

Fix

6b26309

phofl and others added 3 commits August 14, 2023 17:21

Update

da6d67c

Merge remote-tracking branch 'upstream/main' into string_array_numpy_…

4be0ee8

…semantics_2 # Conflicts: # pandas/tests/strings/test_find_replace.py

Fix

6cf2639

jorisvandenbossche reviewed Aug 21, 2023

View reviewed changes

jorisvandenbossche requested a review from mroeschke August 21, 2023 08:50

phofl and others added 7 commits August 21, 2023 10:51

Update pandas/core/arrays/string_arrow.py

0c260fb

Co-authored-by: Joris Van den Bossche <[email protected]>

Add comment

8df070a

Use bool instead of object

1e732c9

Merge remote-tracking branch 'upstream/main' into string_array_numpy_…

5b0d24c

…semantics # Conflicts: # pandas/tests/arrays/string_/test_string.py # pandas/tests/extension/test_string.py

Update docstring

3a913e1

Merge remote-tracking branch 'upstream/main' into string_array_numpy_…

a95d2ee

…semantics # Conflicts: # pandas/tests/strings/__init__.py

Fix

ec56cef

lithomas1 mentioned this pull request Aug 21, 2023

Backport PR #54586 on branch 2.1.x (REF: Refactor conversion of na value) #54658

Merged

jbrockmendel reviewed Aug 22, 2023

View reviewed changes

pandas/tests/strings/test_case_justify.py Outdated Show resolved Hide resolved

Update test_case_justify.py

d05c51d

jbrockmendel reviewed Aug 22, 2023

View reviewed changes

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 22, 2023

View reviewed changes

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 22, 2023

View reviewed changes

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved

Update

aad3d2e

jorisvandenbossche reviewed Aug 22, 2023

View reviewed changes

pandas/tests/strings/test_case_justify.py Outdated Show resolved Hide resolved

Update pandas/tests/strings/test_case_justify.py

5383eeb

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche approved these changes Aug 23, 2023

View reviewed changes

jorisvandenbossche merged commit 00f79a3 into pandas-dev:main Aug 23, 2023

jorisvandenbossche added this to the 2.1 milestone Aug 23, 2023

meeseeksmachine mentioned this pull request Aug 23, 2023

Backport PR #54533 on branch 2.1.x (Implement Arrow String Array that is compatible with NumPy semantics) #54713

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 23, 2023

Backport PR pandas-dev#54533: Implement Arrow String Array that is co…

a822b06

…mpatible with NumPy semantics

phofl deleted the string_array_numpy_semantics branch August 23, 2023 16:59

This was referenced Aug 24, 2023

Infer strings as pyarrow_numpy backed strings #54720

Merged

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

This was referenced Apr 27, 2024

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

Merged

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

jorisvandenbossche mentioned this pull request May 7, 2024

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged

Implement Arrow String Array that is compatible with NumPy semantics #54533

Implement Arrow String Array that is compatible with NumPy semantics #54533

Conversation

phofl commented Aug 13, 2023 • edited by jorisvandenbossche Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 13, 2023

jbrockmendel commented Aug 14, 2023

phofl commented Aug 14, 2023

phofl commented Aug 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 21, 2023

phofl commented Aug 21, 2023

phofl commented Aug 22, 2023

jorisvandenbossche commented Aug 23, 2023 • edited Loading

jorisvandenbossche commented Aug 23, 2023

phofl commented Aug 23, 2023

jorisvandenbossche commented Aug 28, 2023

phofl commented Aug 13, 2023 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Aug 23, 2023 •

edited

Loading