I am parsing some html files using a python script.
As part of this parsing, I need to detect text which contains "<…#id=xxx>" or "<a id=xxx>", but do not want to find text which contains "<….#bshid=xxx>" or "<a bshid=xxx>"
In other words, in the below, I want to find line1 and line2, but NOT line3 or line4
line1='<p><a id=somegoodstuff>SomeGoodStuff</a></p>'
line2='<p><a href="../../../blah/blahblah/abc.htm#id=somegreatstuff">SomeGreatStuff</a></p>'
line3='<p><a bshid=somebadstuff">SomeBadStuff</a></p>'
line4='<p><a href="../../../blah/blahblah/abc.htm#bshid=someawfulstuff">SomeAwfulStuff</a></p>'
I want to do this with minimum changes to the code, as it has been working fine so far, until "bshid=xxxx" links started appearing recently in some new html files. So with only a change to the regex doing the check if possible.
So here is the python script:
import re
id_check = re.compile(r'(<\w+ ([^>]*)id=([^>]*)>)')
line1='<p><a id=somegoodstuff>SomeGoodStuff</a></p>'
line2='<p><a href="../../../blah/blahblah/abc.htm#id=somegreatstuff">SomeGreatStuff</a>
</p>'
line3='<p><a bshid=somebadstuff">SomeBadStuff</a></p>'
line4='<p><a href="../../../blah/blahblah/abc.htm#bshid=someawfulstuff">SomeAwfulStuff</a></p>'
if id_check.findall(line1):
print(line1)
if id_check.findall(line2):
print(line2)
if id_check.findall(line3):
print(line3)
if id_check.findall(line4):
print(line4)
I have tried to do the following, to exclude "bshid" but it will then exclude anything containing the characters ‘b’, ‘s’ or ‘h’ before "id", so that is not what I want to do:
id_check = re.compile(r'(<\w+ ([^>]*)^[bsh]id=([^>]*)>)')
>Solution :
Maybe you could try this, cause it will let you have texts containing the characters ‘b’, ‘s’ or ‘h’ before "id".
id_check = re.compile(r'<\w+ (?![^>]*bshid=)[^>]*id=([^>]*)>')