Guys my question has nothing to do with just + or * in regex. I’m really interested on it’s affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don’t close question without answering.
Hope you won’t close the question because I tried to find answer, but there was no similar case explained.
I have read the documentation, watched a lot of videos about regex but still can’t understand one simple issue.
Why two lines below return different output?
I mean, if we use + which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too.
But why \w+or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w?
Thanks!
print( re.search("\w+\W?\w+", "abcdef==ncabcd"))
print( re.search("\w+\W?\w*", "abcdef==ncabcd"))
<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>
>Solution :
Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.
To explain, you will need to examine exactly what the engine does at each step:
(Hyphens indicate what the engine has matched up to that point.)
\w+\W?\w+
- Step 1: The
\w+matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
- Step 2: The
\W?matches a non-word character, if one exists.
abcdef==ncabcd
-------
- Step 3: The
\w+matches as many word characters as it can with a minimum of one. However, no matching characters are found:
abcdef==ncabcd
-------X
- Step 4: The previous step didn’t find a valid match, so it backs up a bit:
abcdef==ncabcd
-----
- Step 5: Re-apply the check for
\W?. None is found, but the "?" marks it as optional so we can safely continue:
abcdef==ncabcd
-----
- Step 6: Re-apply the check for
\w+, and this time one is found:
abcdef==ncabcd
------
- Step 7: The expression is satisfied, resulting in a match of
abcdef.
\w+\W?\w*
- Step 1: The
\w+matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
- Step 2: The
\W?matches a non-word character, if one exists.
abcdef==ncabcd
-------
- Step 3: The
\w*matches as many word characters as it can with no minimum:
abcdef==ncabcd
-------
- Step 4: The expression is satisfied, resulting in a match of
abcdef=.
To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.