How + or * affects previous character in regular expression?

Guys my question has nothing to do with just + or * in regex. I’m really interested on it’s affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don’t close question without answering.

Hope you won’t close the question because I tried to find answer, but there was no similar case explained.
I have read the documentation, watched a lot of videos about regex but still can’t understand one simple issue.

Why two lines below return different output?
I mean, if we use + which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too.
But why \w+or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w?
Thanks!

print( re.search("\w+\W?\w+", "abcdef==ncabcd"))
print( re.search("\w+\W?\w*", "abcdef==ncabcd"))

<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>

>Solution :

Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.

To explain, you will need to examine exactly what the engine does at each step:

(Hyphens indicate what the engine has matched up to that point.)

\w+\W?\w+

  • Step 1: The \w+ matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w+ matches as many word characters as it can with a minimum of one. However, no matching characters are found:
abcdef==ncabcd
-------X
  • Step 4: The previous step didn’t find a valid match, so it backs up a bit:
abcdef==ncabcd
-----
  • Step 5: Re-apply the check for \W?. None is found, but the "?" marks it as optional so we can safely continue:
abcdef==ncabcd
-----
  • Step 6: Re-apply the check for \w+, and this time one is found:
abcdef==ncabcd
------
  • Step 7: The expression is satisfied, resulting in a match of abcdef.

\w+\W?\w*

  • Step 1: The \w+ matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w* matches as many word characters as it can with no minimum:
abcdef==ncabcd
-------
  • Step 4: The expression is satisfied, resulting in a match of abcdef=.

To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.

Leave a Reply