Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How + or * affects previous character in regular expression?

Guys my question has nothing to do with just + or * in regex. I’m really interested on it’s affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don’t close question without answering.

Hope you won’t close the question because I tried to find answer, but there was no similar case explained.
I have read the documentation, watched a lot of videos about regex but still can’t understand one simple issue.

Why two lines below return different output?
I mean, if we use + which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too.
But why \w+or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w?
Thanks!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

print( re.search("\w+\W?\w+", "abcdef==ncabcd"))
print( re.search("\w+\W?\w*", "abcdef==ncabcd"))

<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>

>Solution :

Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.

To explain, you will need to examine exactly what the engine does at each step:

(Hyphens indicate what the engine has matched up to that point.)

\w+\W?\w+

  • Step 1: The \w+ matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w+ matches as many word characters as it can with a minimum of one. However, no matching characters are found:
abcdef==ncabcd
-------X
  • Step 4: The previous step didn’t find a valid match, so it backs up a bit:
abcdef==ncabcd
-----
  • Step 5: Re-apply the check for \W?. None is found, but the "?" marks it as optional so we can safely continue:
abcdef==ncabcd
-----
  • Step 6: Re-apply the check for \w+, and this time one is found:
abcdef==ncabcd
------
  • Step 7: The expression is satisfied, resulting in a match of abcdef.

\w+\W?\w*

  • Step 1: The \w+ matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w* matches as many word characters as it can with no minimum:
abcdef==ncabcd
-------
  • Step 4: The expression is satisfied, resulting in a match of abcdef=.

To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading