-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175
ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175
Conversation
@rhshadrach pinging on green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
pandas/core/strings/accessor.py
Outdated
n: int = -1, | ||
case: bool | None = None, | ||
flags: int = 0, | ||
regex: bool = False, | ||
repl_kwargs: dict | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of adding a new keyword, we want to allow pat
to be a dictionary (in which case repl
must be None
).
@rhshadrach pinging on green |
@rhshadrach pinging on green |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Still interested in working on this PR. Will update the PR and address the above concerns tomorrow afternoon EST |
@rhshadrach Pinging on green |
result = res_output.array._str_replace( | ||
key, value, n=n, case=case, flags=flags, regex=regex | ||
) | ||
res_output = self._wrap_result(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can call _wrap_result just once at the end, rather than inside the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach wouldn't you need the for loop in the case that pat contained multiple key : value pairs of strings to be replaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - I'm not suggesting to remove the for loop entirely. Just to call self._wrap_result
once after the for loop is done rather than every iteration. If you think this is incorrect, let me know and I can take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after doing a bit of debugging it looks like we need to call _wrap_result after each iteration so we can save the output of our string replace and update res_output
.
self._data is a Series and _str_replace()
returns an NDArray. Since we can't update self._data.array, we need a container to save the output of our string replace, so we're converting it to a Series using _wrap_result()
and then updating our container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - thanks for checking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than doing the loop here, is there any immediate advantage to passing the dict onto the arrays _str_replace
method? like avoiding the _wrap_result
IIUC the accessors should only be validating the passed parameters, defining the "pandas string API", providing the documentation and wrapping the array result into a Series.
IMO the implementation should be at array level and then can be overridden if the array types can be optimized or use native methods.
For example, maybe using "._str_map" could be faster for object type and maybe pyarrow.compute.replace_substring_regex
could be used for arrow backed strings?
The array level optimizations need not be in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array level optimizations need not be in this PR though.
I think this is a good idea, but agreed it need not be here (this is perf neutral compared to the status quo). If not tackled here, we can throw up an issue noting the performance improvement. @rmhowe425 - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I think this idea is deserving of a separate issue. Happy to work that issue as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
@rhshadrach Pinging on green |
pandas/core/strings/accessor.py
Outdated
pat: str | re.Pattern | dict | None = None, | ||
repl: str | Callable | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC if a dict is passed then repl
is not needed. When would pat
be None
and why is it the default?
Thanks @rmhowe425! |
Is order behavior undefined? |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.