I am trying to parse an HTML Page using Regualr Expressions. I have to find out the sum of all comments from this web page: https://py4e-data.dr-chuck.net/comments_42.html
Everything else is working fine but the re.findall function is only picking up second digit of a two digit number. I am not able to figure out why is this happening.
This is my code:
import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup import ssl import re ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE code = list() html = urllib.request.urlopen("https://py4e-data.dr-chuck.net/comments_42.html", context=ctx) for line in html: line = line.decode() line = line.strip() numbers = re.findall("<span.+([0-9]+)", line) if len(numbers) != 1: continue print(numbers)
This is my output: (I am geting 7 instead of 97, 0 instead of 90)
Regexes are greedy by default (not just in Python, in basically every regex system I’m aware of), so they try to take as many characters as possible for each variable length match (e.g.
+) in the regex, from left to right, so long as they can still match with what remains. As such, the
<span.+([0-9]+) is matching all the characters save the very last one (which must be left to match
[0-9]+ can never match more than one.
You can solve this in various ways:
If the characters between
spanand the desired digits will never be digits themselves, only match non-digits instead of
r"<span[^0-9]+([0-9]+)"(note: I used an
rprefix to make that a raw string, which you should always do with Python regex literals to avoid issues with string escapes overlapping regex escapes; it would allow you to safely use
\din place of
[0-9]respectively if you liked, and weren’t concerned with non-ASCII digits). The regex is still greedy, and should perform equally well, but it will stop at the first run of digits and capture all of them, rather than capturing only the final digit of the last run of digits.
If they might be digits, and you want to capture the last digits, make the
.+non-greedy by changing the regex to
+means "match the fewest characters possible", rather than the default greedy "match as many as possible". It will typically make the regex run a little slower, but not enough to matter in most cases.