How to understand snippet of Regex

Advertisements

I am attempting to understand what this snippet of code does:

passwd1=re.sub(r'^.*? --', ' -- ', line)
password=passwd1[4:]

I understand that the top line uses regex to remove the " — ", and the bottom line I think removes something as well? I went back to this code after a while and need to improve it but to do that I need to understand this again. I’ve been trying to read regex docs to no avail, what is this: r'^.*? at the beginning of the regex?.

>Solution :

To break r'^.*? -- into pieces:

  • r in front of a string in Python lets the interpreter know that it’s a regex string. This lets you not have to do a bunch of confusing character escaping.
  • The ^ tells the regex to match only from the beginning of the string.
  • .*? tells the regex to match any number of characters up to…
  • --, which is a literal match.

The sum of this is that it will match any string, starting at the beginning of a line up to the -- demarcation. Since it is re.sub(), the matched part of the string will be replaced with --.

This is why something like Google -- MyPassword becomes -- MyPassword.

The second line is a simple string slice, dropping the first four elements (characters) of the string. This might be superfluous – you could just substitute the match with an empty string like this:

passwd1 = re.sub(r'^.* --', '', line)

This achieves the same result. Note I’ve dropped the ?, which is also superfluous here, because the * has a similar but broader effect. There are some technical differences, but I don’t think you need it for your stated purpose.

? will match zero or one of the previous character – in this case a ., which is ‘any character’. The * will match zero or more of the previous character. .* is what is known as a greedy quantifier, and .*? a lazy quantifier. That is, the greedy quantifier will match as much as possible, and the lazy will match as little as possible. The difference between ^.*? -- and ^.* -- is what is matched in this case:

Something something -- mypassword -- yourpassword

In the greedy case, the first two clauses (‘something something — mypassword’) are matched and deleted. In the lazy case, only ‘something something’ is deleted. Most passwords don’t include spaces, nevermind ‘ — ‘, so you probably want to use the greedy version.

Leave a ReplyCancel reply