Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex doesn't match subgroups as expected

I am writing a C program that makes use of regex to detect a string like:

hl # # #

Where # indicates an integer. Up to 11 integer values should be captured here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have written the following regular expression to satisfy this, (and below that, I provided a sample value) :

#define HARDLINK_MATCH_EXPRESSION "^hl( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*( ([1-9][0-9]*))*"
char *sampleValue = "hl 1 2 3 4 5 6 7 8 9 10 11" ;

However, in both the C program and at https://regex101.com/ I find that the string matches, but only one integer (the last one) is "captured." My thinking is that this defines a series of eleven greedy "outer" groups, and I expected that each group would be satisfied by this example.

I have pored through uncounted solutions to this sort of problem, and though I’m certain my question is probably not unique, I couldn’t find a match in the multitude of answers.

Can someone explain why the expression does not capture the inner groups as well as the final (until the expression runs out at 11 captures) ??

EDIT

While I still would like the first question answered, my real question is whether there is a way to capture all of the groups in the sample.

Thanks.

>Solution :

The hl( ([1-9][0-9]*))* portion of your pattern matches the entire string. The rest of the pattern just matches empty strings. So you have two captures that match something other than an empty string.

Changing each ( ([1-9][0-9]*))* to ( ([1-9][0-9]*))? would solve the problem. However, I would write the pattern as follows:

^
hl
(?:
   [ ] (0|[1-9][0-9]*)
   (?:
      [ ] (0|[1-9][0-9]*)
      (?:
         [ ] (0|[1-9][0-9]*)
         (?:
            [ ] (0|[1-9][0-9]*)
            (?:
               [ ] (0|[1-9][0-9]*)
               (?:
                  [ ] (0|[1-9][0-9]*)
                  (?:
                     [ ] (0|[1-9][0-9]*)
                     (?:
                        [ ] (0|[1-9][0-9]*)
                        (?:
                           [ ] (0|[1-9][0-9]*)
                           (?:
                              [ ] (0|[1-9][0-9]*)
                              (?:
                                 [ ] (0|[1-9][0-9]*)
                              )?
                           )?
                        )?
                     )?
                  )?
               )?
            )?
         )?
      )?
   )?
)?

I wrote it as if whitespace is allowed (as per the x flag in some engines) for readability. If you don’t have that luxury, it compresses into the following:

^hl(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*)(?: (0|[1-9][0-9]*))?)?)?)?)?)?)?)?)?)?)?

Notes:

  • I removed the extra captures. You had twice as many as desired.
  • Using (?: [ ] (0|[1-9][0-9]*) (?: ... )? )? instead of (?: [ ] (0|[1-9][0-9]*) )? (?: ... )? makes more sense, and it can greatly reduce backtracking on a failed match.
  • I changed the pattern to allow a 0. Replace 0|[1-9][0-9]* with [1-9][0-9]* to continue disallowing it.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading