-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175
Changes from 3 commits
b1ea8c2
573e6e4
f9fe71e
77a579e
5de632a
d472b01
eceb234
5702ea9
ab28c9e
3be3c6b
f01728f
9ca546b
591e380
bae43ed
0359039
f0dcd55
f626cfb
1f50035
a3014de
95947a5
470a648
c6f2d3a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1394,12 +1394,13 @@ def fullmatch(self, pat, case: bool = True, flags: int = 0, na=None): | |
@forbid_nonstring_types(["bytes"]) | ||
def replace( | ||
self, | ||
pat: str | re.Pattern, | ||
repl: str | Callable, | ||
pat: str | re.Pattern | None = None, | ||
repl: str | Callable | None = None, | ||
n: int = -1, | ||
case: bool | None = None, | ||
flags: int = 0, | ||
regex: bool = False, | ||
repl_kwargs: dict | None = None, | ||
): | ||
r""" | ||
Replace each occurrence of pattern/regex in the Series/Index. | ||
|
@@ -1434,6 +1435,9 @@ def replace( | |
- If False, treats the pattern as a literal string | ||
- Cannot be set to False if `pat` is a compiled regex or `repl` is | ||
a callable. | ||
repl_kwargs : dict, default None | ||
<key : value> pairs representing strings to be replaced, and their | ||
updated values. | ||
|
||
Returns | ||
------- | ||
|
@@ -1447,6 +1451,8 @@ def replace( | |
* if `regex` is False and `repl` is a callable or `pat` is a compiled | ||
regex | ||
* if `pat` is a compiled regex and `case` or `flags` is set | ||
* if `pat` and `repl_kwargs` both equal None | ||
* if `pat` and `repl_kwargs` are both specified | ||
|
||
Notes | ||
----- | ||
|
@@ -1456,6 +1462,15 @@ def replace( | |
|
||
Examples | ||
-------- | ||
When `repl_kwargs` is a dictionary, every key in `repl_kwargs` is replaced | ||
with its corresponding value: | ||
|
||
>>> pd.Series(['A', 'B', np.nan]).str.replace(repl_kwargs={'A': 'a', 'B': 'b'}) | ||
0 a | ||
1 b | ||
2 NaN | ||
dtype: object | ||
|
||
When `pat` is a string and `regex` is True, the given `pat` | ||
is compiled as a regex. When `repl` is a string, it replaces matching | ||
regex patterns as with :meth:`re.sub`. NaN value(s) in the Series are | ||
|
@@ -1518,8 +1533,19 @@ def replace( | |
2 NaN | ||
dtype: object | ||
""" | ||
if pat is None and repl_kwargs is None: | ||
raise ValueError( | ||
"Cannot replace a string without specifying a string to be modified." | ||
) | ||
|
||
if pat is not None and repl_kwargs is not None: | ||
raise ValueError( | ||
"Cannot replace a string using both a pattern and <key : value> " | ||
"combination." | ||
) | ||
|
||
# Check whether repl is valid (GH 13438, GH 15055) | ||
if not (isinstance(repl, str) or callable(repl)): | ||
if pat and not (isinstance(repl, str) or callable(repl)): | ||
raise TypeError("repl must be a string or callable") | ||
|
||
is_compiled_re = is_re(pat) | ||
|
@@ -1539,10 +1565,21 @@ def replace( | |
if case is None: | ||
case = True | ||
|
||
result = self._data.array._str_replace( | ||
pat, repl, n=n, case=case, flags=flags, regex=regex | ||
) | ||
return self._wrap_result(result) | ||
if repl_kwargs: | ||
res_output = self._data | ||
for key, value in repl_kwargs.items(): | ||
result = res_output.array._str_replace( | ||
key, str(value), n=n, case=case, flags=flags, regex=regex | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
res_output = self._wrap_result(result) | ||
|
||
else: | ||
result = self._data.array._str_replace( | ||
pat, repl, n=n, case=case, flags=flags, regex=regex | ||
) | ||
res_output = self._wrap_result(result) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you can call _wrap_result just once at the end, rather than inside the for loop. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rhshadrach wouldn't you need the for loop in the case that pat contained multiple key : value pairs of strings to be replaced? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct - I'm not suggesting to remove the for loop entirely. Just to call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So after doing a bit of debugging it looks like we need to call _wrap_result after each iteration so we can save the output of our string replace and update self._data is a Series and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see - thanks for checking! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rather than doing the loop here, is there any immediate advantage to passing the dict onto the arrays IIUC the accessors should only be validating the passed parameters, defining the "pandas string API", providing the documentation and wrapping the array result into a Series. IMO the implementation should be at array level and then can be overridden if the array types can be optimized or use native methods. For example, maybe using "._str_map" could be faster for object type and maybe The array level optimizations need not be in this PR though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, but agreed it need not be here (this is perf neutral compared to the status quo). If not tackled here, we can throw up an issue noting the performance improvement. @rmhowe425 - thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed! I think this idea is deserving of a separate issue. Happy to work that issue as well! |
||
|
||
return res_output | ||
|
||
@forbid_nonstring_types(["bytes"]) | ||
def repeat(self, repeats): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of adding a new keyword, we want to allow
pat
to be a dictionary (in which caserepl
must beNone
).