Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175

Merged
merged 22 commits into from
Feb 12, 2024
Merged

ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175

merged 22 commits into from
Feb 12, 2024

Conversation

rmhowe425
Copy link
Contributor

@rmhowe425
Copy link
Contributor Author

@rhshadrach pinging on green

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

n: int = -1,
case: bool | None = None,
flags: int = 0,
regex: bool = False,
repl_kwargs: dict | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of adding a new keyword, we want to allow pat to be a dictionary (in which case repl must be None).

@rmhowe425
Copy link
Contributor Author

@rhshadrach pinging on green

@rmhowe425 rmhowe425 requested a review from rhshadrach November 30, 2023 05:12
@rmhowe425 rmhowe425 requested a review from rhshadrach December 3, 2023 20:51
@rmhowe425
Copy link
Contributor Author

@rhshadrach pinging on green

pandas/core/strings/accessor.py Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Jan 9, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 9, 2024
@rmhowe425
Copy link
Contributor Author

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Still interested in working on this PR. Will update the PR and address the above concerns tomorrow afternoon EST

@rmhowe425 rmhowe425 requested a review from rhshadrach January 20, 2024 03:35
@rmhowe425
Copy link
Contributor Author

@rhshadrach Pinging on green

result = res_output.array._str_replace(
key, value, n=n, case=case, flags=flags, regex=regex
)
res_output = self._wrap_result(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can call _wrap_result just once at the end, rather than inside the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach wouldn't you need the for loop in the case that pat contained multiple key : value pairs of strings to be replaced?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - I'm not suggesting to remove the for loop entirely. Just to call self._wrap_result once after the for loop is done rather than every iteration. If you think this is incorrect, let me know and I can take a closer look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after doing a bit of debugging it looks like we need to call _wrap_result after each iteration so we can save the output of our string replace and update res_output.

self._data is a Series and _str_replace() returns an NDArray. Since we can't update self._data.array, we need a container to save the output of our string replace, so we're converting it to a Series using _wrap_result() and then updating our container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks for checking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the loop here, is there any immediate advantage to passing the dict onto the arrays _str_replace method? like avoiding the _wrap_result

IIUC the accessors should only be validating the passed parameters, defining the "pandas string API", providing the documentation and wrapping the array result into a Series.

IMO the implementation should be at array level and then can be overridden if the array types can be optimized or use native methods.

For example, maybe using "._str_map" could be faster for object type and maybe pyarrow.compute.replace_substring_regex could be used for arrow backed strings?

The array level optimizations need not be in this PR though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array level optimizations need not be in this PR though.

I think this is a good idea, but agreed it need not be here (this is perf neutral compared to the status quo). If not tackled here, we can throw up an issue noting the performance improvement. @rmhowe425 - thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I think this idea is deserving of a separate issue. Happy to work that issue as well!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added Enhancement Strings String extension data type and string data and removed Stale labels Feb 8, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Feb 8, 2024
doc/source/whatsnew/v2.2.0.rst Outdated Show resolved Hide resolved
@rmhowe425
Copy link
Contributor Author

pre-commit.ci autofix

@rmhowe425
Copy link
Contributor Author

@rhshadrach Pinging on green

Comment on lines 1418 to 1419
pat: str | re.Pattern | dict | None = None,
repl: str | Callable | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if a dict is passed then repl is not needed. When would pat be None and why is it the default?

@rmhowe425 rmhowe425 requested a review from rhshadrach February 10, 2024 01:19
@rhshadrach rhshadrach merged commit 8871b1c into pandas-dev:main Feb 12, 2024
47 checks passed
@rhshadrach
Copy link
Member

Thanks @rmhowe425!

@rmhowe425 rmhowe425 deleted the dev/Series/str-replace branch February 17, 2024 17:19
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
@kuraga
Copy link

kuraga commented May 12, 2024

Is order behavior undefined?
Is it worthy to mention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Allow dictionaries to be passed to pandas.Series.str.replace
4 participants