Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: "str.contains" match groups UserWarning discourages best practices & slows performance #56798

Open
3 tasks done
bionicles opened this issue Jan 9, 2024 · 2 comments
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@bionicles
Copy link

bionicles commented Jan 9, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

test_regex_groups_warning.py

import pandas as pd


fruits = pd.Series(["apple", "banana", "apple and banana"])


def test_user_warning_in_series():
    print("fruits:\n", fruits)
    print("UNNAMED REGEX")  # less accessible
    nand_query = r"^(?!(?=.*apple)(?=.*banana)).*$"
    print(f"'{nand_query}'")
    filtered = fruits[fruits.str.contains(nand_query)]
    print("filtered:\n", filtered)
    nand_condition = ~(
        filtered.str.contains("head", case=False)
        & filtered.str.contains("neck", case=False)
    )
    print("nand_condition:\n", nand_condition)
    if not nand_condition.all():
        failures = filtered[~nand_condition]
        print("failures:\n", failures)
    assert nand_condition.all()
    assert 0


def test_user_warning_with_names():
    print("fruits:\n", fruits)
    print("NAMED REGEX")  # more clear about what is matched and why:
    named_nand_query = r"(?P<nand>^(?P<not_>(?!(?P<and_>(?=.*apple)(?=.*banana)))).*$)"
    print(f"'{named_nand_query}'")
    filtered_named = fruits[fruits.str.contains(named_nand_query)]
    print("filtered_named:\n", filtered_named)
    named_nand_condition = ~(
        filtered_named.str.contains("head", case=False)
        & filtered_named.str.contains("neck", case=False)
    )
    print("named_nand_condition:\n", named_nand_condition)
    if not named_nand_condition.all():
        named_failures = filtered_named[~named_nand_condition]
        print("named_failures:\n", named_failures)
    assert named_nand_condition.all()
    assert 0

Issue Description

image

@phofl wrote re: PR 56763 to remove this warning

Please open issues to facilitate a discussion before you open a PR

Sorry, my bad, here is an issue to discuss the warning I proposed to remove.

PRE-TL;DR:

  • Named groups aid regex understanding and maintenance
  • The warning logic also slows down pandas regex perf

result of running reproducible example:

image

This warning is annoying and serves to discourage use of named capturing groups (which are more maintainable) because users must either switch to extracting the groups (not always necessary) or replace their named groups with "(?:" (noncapturing groups are harder to maintain because it's less clear what is their objective within a greater regex pattern).

If users need to specialize their regex patterns to each command, then they need to maintain multiple copies, some with non-captured groups, some without, just to silence some warning, also, if they remove the groups, then later on when they want to use them, they might have to figure out how to replace groups they removed just to silence a warning, and be frustrated.

The logical condition for the warning also compiles the pattern but then the compiled pattern is discarded, so this warning slows down every "contains" in pandas just to check if we should annoy people who probably know what they're doing.

If we remove this unnecessary warning, then we no longer discourage users who use named capturing groups, thus facilitating readability of the patterns, and portability to other contexts, such as debuggers or the "extract" method mentioned in the removed warning.

TL;DR: This warning doesn't need to exist, discourages best practices, and slows performance of every string contains query in pandas (a super "hot path!"), so I suggest we remove it.

image

here is a permalink to the line of code to check for the warning condition

        dtype: bool
        """
-        if regex and re.compile(pat).groups: # compiled and discarded the pattern
-            warnings.warn(
-                "This pattern is interpreted as a regular expression, and has "
-                "match groups. To actually get the groups, use str.extract.",
-                UserWarning,
-                stacklevel=find_stack_level(),
-            )

        result = self._data.array._str_contains(pat, case, flags, na, regex)
        return self._wrap_result(result, fill_value=na, returns_string=False)

just compile the pattern and use the compiled pattern:

        dtype: bool
        """
        result = self._data.array._str_contains(pat, case, flags, na, regex)
        return self._wrap_result(result, fill_value=na, returns_string=False)
``
optionally (may cause issues depending on _str_contains logic:
```diff
- result = self._data.array._str_contains(pat, case, flags, na, regex)
+ compiled_pattern = re.compile(pat, flags)
+ result = self._data.array._str_contains(compiled_pattern, case, flags, na, regex)

Expected Behavior

no warning for containment queries using groups

Installed Versions

python
Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
pd.show_versions()
/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
Traceback (most recent call last):
File "", line 1, in
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 141, in show_versions
deps = _get_dependency_info()
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 98, in _get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py", line 132, in import_optional_dependency
module = importlib.import_module(name)
File "/home/bion/miniconda3/envs/py310/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/init.py", line 42, in
from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/init.py", line 3, in
from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/decorators.py", line 3, in
from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception

``` > pip show pandas Name: pandas Version: 2.1.4 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: Author-email: The Pandas Development Team License: BSD 3-Clause License
    Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
    All rights reserved.

    Copyright (c) 2011-2023, Open source contributors.

    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice, this
      list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice,
      this list of conditions and the following disclaimer in the documentation
      and/or other materials provided with the distribution.

    * Neither the name of the copyright holder nor the names of its
      contributors may be used to endorse or promote products derived from
      this software without specific prior written permission.

    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
    DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
    SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
    CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
    OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Location: /home/bion/miniconda3/envs/py310/lib/python3.10/site-packages
Requires: numpy, python-dateutil, pytz, tzdata

@bionicles bionicles added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 9, 2024
@bionicles
Copy link
Author

follow up: for easy wins in pandas performance, a review of all the logical conditions for warning in pandas might be warranted, if we're compiling stuff just to check if we ought to warn people, then we could remove these warnings where possible to speed everything up and simplify the codebase

@bionicles bionicles changed the title BUG: Match Groups UserWarning BUG: "str.contains" match groups UserWarning discourages best practices & slows performance Jan 9, 2024
@rhshadrach
Copy link
Member

Thanks for the report. There can be a performance hit on the actual operation when using capturing vs non-capturing groups however. I think this might be another good case for #55385.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants