Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ensure "string[pyarrow]" type is preserved when calling extractall #55534

Conversation

ABizzinotto
Copy link
Contributor

When calling extractall on an arrow string series, although the operation itself works, the returned dataframe column is of dtype object instead of preserving the arrow string type. This happens since the resulting dataframe was generated with the original array dtype only if it was an instance of StringDType. Expanded the conditional to also preserve the dtype on ArrowDType.

@ABizzinotto ABizzinotto force-pushed the BUG-53846-extractall-with-arrow-returns-object-dtype branch from 74d398c to e5ef14b Compare October 15, 2023 17:41
@@ -30,6 +30,7 @@ Bug fixes
- Fixed bug in :meth:`Index.insert` raising when inserting ``None`` into :class:`Index` with ``dtype="string[pyarrow_numpy]"`` (:issue:`55365`)
- Fixed bug in :meth:`Series.all` and :meth:`Series.any` not treating missing values correctly for ``dtype="string[pyarrow_numpy]"`` (:issue:`55367`)
- Fixed bug in :meth:`Series.rank` for ``string[pyarrow_numpy]`` dtype (:issue:`55362`)
- Fixed bug in :meth:`Series.str.extractall` for ``string[pyarrow]`` dtype being converted to object (:issue:`53846`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fixed bug in :meth:`Series.str.extractall` for ``string[pyarrow]`` dtype being converted to object (:issue:`53846`)
- Fixed bug in :meth:`Series.str.extractall` for :class:`ArrowDtype` being converted to object (:issue:`53846`)

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks could you add a test for this?

@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Oct 16, 2023
@ABizzinotto
Copy link
Contributor Author

Thank you @mroeschke, added

@ABizzinotto ABizzinotto requested a review from mroeschke October 18, 2023 05:39
@simonjayhawkins simonjayhawkins added Bug pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result labels Oct 18, 2023
@@ -2,6 +2,7 @@
import re

import numpy as np
import pyarrow as pa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to import this in test_extractall_preserves_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke thanks, moved the import. I also added the requirement to the pypy and numpydev actions yaml files but some tests are still failing, so before making any additional changes, I thought I'd ask what else might need to be changed so ci can run. Looks like the meta.yaml file needs it added as a requirement as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding pyarrow to the dependency files you can structure the test like

def test_whatever():
    pa = pytest.importorskip("pyarrow")
    series = ...

So the test will be skipped if pyarrow is not installed or you will have pyarrow accessible as pa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough 👍 Thank you. There's still one spec failing but it's also failing in main, it's due to a Numpy deprecation warning as far as I can tell.

@pytest.mark.parametrize(
"data, expected_dtype",
[
(Series(["abc", "ab"], dtype=ArrowDtype(pa.string())), "string[pyarrow]"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just test this case since I think the other cases are tested

@ABizzinotto ABizzinotto force-pushed the BUG-53846-extractall-with-arrow-returns-object-dtype branch from e2a0852 to f035363 Compare October 18, 2023 19:53
@ABizzinotto ABizzinotto force-pushed the BUG-53846-extractall-with-arrow-returns-object-dtype branch from f035363 to 03ed4a1 Compare October 18, 2023 19:53
@ABizzinotto ABizzinotto requested a review from mroeschke October 19, 2023 11:42
@mroeschke mroeschke added this to the 2.1.2 milestone Oct 19, 2023
@mroeschke mroeschke merged commit c9dc91d into pandas-dev:main Oct 19, 2023
32 of 33 checks passed
@mroeschke
Copy link
Member

Thanks @ABizzinotto

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 19, 2023
mroeschke pushed a commit that referenced this pull request Oct 19, 2023
…e is preserved when calling extractall) (#55597)

Backport PR #55534: BUG: Ensure "string[pyarrow]" type is preserved when calling extractall

Co-authored-by: Amanda Bizzinotto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.str.extractall with arrow string returns object dtype
3 participants