Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixing named and numbered capture groups inside branch-reset causes wrong matches inside groups #425

Open
mrabarnett opened this issue Sep 28, 2021 · 2 comments
Labels
bug Something isn't working major

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


Attemting to match BUG! with the following regex causes the named group <bug> to store the value of the numbered group \1.

(?|(?P<bug>xxx)(!)
  |(?P<bug>BUG)(!)
)

=> <bug> will contain ! instead of BUG

A second bug can be seen when we change the regex to not redefine <bug>; then, the numbered group \1 gets dropped alltogether.

A file with test cases are attached.

Mixing named and numbered groups is probably not done intentionally by most people (although i guess there could be applications for it), but by accidentally omitting the non-capturing (?:) boilerplate.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


That's an interesting issue, but is it a bug?

The rule is that groups are numbered consecutively 1, 2, 3, etc, but named groups have the same group number.

The problem here is that the branch reset is restarting the numbering and it's not skipping over group numbers that have already been used in that branch. The question is whether it should.

To give another example, if the second branch was (BUG)(?<bug>!), then, according to the rule, (BUG) would be group 1 because it's the first group in the branch and (?<bug>!) would also be group 1 because it's a named group that’s already defined as group 1.

I'll need to see how other implementations handle the question.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


re: Compiled patterns have a groupindex attribute that maps group names to group numbers, so a group name can be linked to only 1 group number. This must also be true of regex in order to maintain compatibility. Branch reset is not supported..

Perl: Both branches of a branch reset number consecutively, and a group name can be linked to more than 1 group number. This behaviour is incompatible with re, and therefore regex.

PCRE2: Both branches of a branch reset will number consecutively, but group number can be linked to only 1 group name, so a pattern such as (?|(?<AA>aa)|(?<BB>bb)) will cause an error.

C#: Unnamed groups are numbered first, followed by named groups. This behaviour different from re, and therefore regex. Branch reset is not supported.

So, what should the rule be? How should group numbers be assigned in the examples below?

(?|(?P<bug>xxx)(!)|(?P<bug>BUG)(!))  (currently 1, 2, 1, 1)
(?|(?P<bug>xxx)(!)|(BUG)(?P<bug>!))  (currently 1, 2, 1, 1)
(?|(xxx)(?P<bug>!)|(?P<bug>BUG)(!))  (currently 1, 2, 2, 1)
(?|(xxx)(?P<bug>!)|(BUG)(?P<bug>!))  (currently 1, 2, 1, 2)

Some options are:

  1. Number consecutively (current behaviour).
  2. Number consecutively, but skip group numbers that have been used up to that point in the branch.
  3. Number consecutively, but skip group numbers that have used anywhere in that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant