You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please open issues to facilitate a discussion before you open a PR
Sorry, my bad, here is an issue to discuss the warning I proposed to remove.
PRE-TL;DR:
Named groups aid regex understanding and maintenance
The warning logic also slows down pandas regex perf
result of running reproducible example:
This warning is annoying and serves to discourage use of named capturing groups (which are more maintainable) because users must either switch to extracting the groups (not always necessary) or replace their named groups with "(?:" (noncapturing groups are harder to maintain because it's less clear what is their objective within a greater regex pattern).
If users need to specialize their regex patterns to each command, then they need to maintain multiple copies, some with non-captured groups, some without, just to silence some warning, also, if they remove the groups, then later on when they want to use them, they might have to figure out how to replace groups they removed just to silence a warning, and be frustrated.
The logical condition for the warning also compiles the pattern but then the compiled pattern is discarded, so this warning slows down every "contains" in pandas just to check if we should annoy people who probably know what they're doing.
If we remove this unnecessary warning, then we no longer discourage users who use named capturing groups, thus facilitating readability of the patterns, and portability to other contexts, such as debuggers or the "extract" method mentioned in the removed warning.
TL;DR: This warning doesn't need to exist, discourages best practices, and slows performance of every string contains query in pandas (a super "hot path!"), so I suggest we remove it.
dtype: bool
"""
- if regex and re.compile(pat).groups: # compiled and discarded the pattern- warnings.warn(- "This pattern is interpreted as a regular expression, and has "- "match groups. To actually get the groups, use str.extract.",- UserWarning,- stacklevel=find_stack_level(),- )
result = self._data.array._str_contains(pat, case, flags, na, regex)
return self._wrap_result(result, fill_value=na, returns_string=False)
just compile the pattern and use the compiled pattern:
dtype: bool
"""
result = self._data.array._str_contains(pat, case, flags, na, regex)
return self._wrap_result(result, fill_value=na, returns_string=False)
``
optionally (may cause issues depending on _str_contains logic:
```diff
- result = self._data.array._str_contains(pat, case, flags, na, regex)+ compiled_pattern = re.compile(pat, flags)+ result = self._data.array._str_contains(compiled_pattern, case, flags, na, regex)
Expected Behavior
no warning for containment queries using groups
Installed Versions
python
Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import pandas as pd
pd.show_versions()
/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
Traceback (most recent call last):
File "", line 1, in
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 141, in show_versions
deps = _get_dependency_info()
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 98, in _get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py", line 132, in import_optional_dependency
module = importlib.import_module(name)
File "/home/bion/miniconda3/envs/py310/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/init.py", line 42, in
from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/init.py", line 3, in
from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/decorators.py", line 3, in
from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception
```
> pip show pandas
Name: pandas
Version: 2.1.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author:
Author-email: The Pandas Development Team
License: BSD 3-Clause License
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.
Copyright (c) 2011-2023, Open source contributors.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
follow up: for easy wins in pandas performance, a review of all the logical conditions for warning in pandas might be warranted, if we're compiling stuff just to check if we ought to warn people, then we could remove these warnings where possible to speed everything up and simplify the codebase
bionicles
changed the title
BUG: Match Groups UserWarning
BUG: "str.contains" match groups UserWarning discourages best practices & slows performance
Jan 9, 2024
Thanks for the report. There can be a performance hit on the actual operation when using capturing vs non-capturing groups however. I think this might be another good case for #55385.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
@phofl wrote re: PR 56763 to remove this warning
Sorry, my bad, here is an issue to discuss the warning I proposed to remove.
PRE-TL;DR:
result of running reproducible example:
This warning is annoying and serves to discourage use of named capturing groups (which are more maintainable) because users must either switch to extracting the groups (not always necessary) or replace their named groups with "(?:" (noncapturing groups are harder to maintain because it's less clear what is their objective within a greater regex pattern).
If users need to specialize their regex patterns to each command, then they need to maintain multiple copies, some with non-captured groups, some without, just to silence some warning, also, if they remove the groups, then later on when they want to use them, they might have to figure out how to replace groups they removed just to silence a warning, and be frustrated.
The logical condition for the warning also compiles the pattern but then the compiled pattern is discarded, so this warning slows down every "contains" in pandas just to check if we should annoy people who probably know what they're doing.
If we remove this unnecessary warning, then we no longer discourage users who use named capturing groups, thus facilitating readability of the patterns, and portability to other contexts, such as debuggers or the "extract" method mentioned in the removed warning.
TL;DR: This warning doesn't need to exist, discourages best practices, and slows performance of every string contains query in pandas (a super "hot path!"), so I suggest we remove it.
here is a permalink to the line of code to check for the warning condition
just compile the pattern and use the compiled pattern:
Expected Behavior
no warning for containment queries using groups
Installed Versions
Location: /home/bion/miniconda3/envs/py310/lib/python3.10/site-packages
Requires: numpy, python-dateutil, pytz, tzdata
The text was updated successfully, but these errors were encountered: