How do I find lines containing a specific string when it is not prefixed by another string with regex

I am parsing some html files using a python script.

As part of this parsing, I need to detect text which contains "<…#id=xxx>" or "<a id=xxx>", but do not want to find text which contains "<….#bshid=xxx>" or "<a bshid=xxx>"

In other words, in the below, I want to find line1 and line2, but NOT line3 or line4

line1='<p><a id=somegoodstuff>SomeGoodStuff</a></p>'
line2='<p><a href="../../../blah/blahblah/abc.htm#id=somegreatstuff">SomeGreatStuff</a></p>'
line3='<p><a bshid=somebadstuff">SomeBadStuff</a></p>'
line4='<p><a href="../../../blah/blahblah/abc.htm#bshid=someawfulstuff">SomeAwfulStuff</a></p>'

I want to do this with minimum changes to the code, as it has been working fine so far, until "bshid=xxxx" links started appearing recently in some new html files. So with only a change to the regex doing the check if possible.

So here is the python script:

import re

id_check = re.compile(r'(<\w+ ([^>]*)id=([^>]*)>)')

line1='<p><a id=somegoodstuff>SomeGoodStuff</a></p>'
line2='<p><a href="../../../blah/blahblah/abc.htm#id=somegreatstuff">SomeGreatStuff</a> 
</p>'
line3='<p><a bshid=somebadstuff">SomeBadStuff</a></p>'
line4='<p><a href="../../../blah/blahblah/abc.htm#bshid=someawfulstuff">SomeAwfulStuff</a></p>'


if id_check.findall(line1):
    print(line1)
if id_check.findall(line2):
    print(line2)
if id_check.findall(line3):
    print(line3)
if id_check.findall(line4):
    print(line4)

I have tried to do the following, to exclude "bshid" but it will then exclude anything containing the characters ‘b’, ‘s’ or ‘h’ before "id", so that is not what I want to do:

id_check = re.compile(r'(<\w+ ([^>]*)^[bsh]id=([^>]*)>)')

>Solution :

Maybe you could try this, cause it will let you have texts containing the characters ‘b’, ‘s’ or ‘h’ before "id".

id_check = re.compile(r'<\w+ (?![^>]*bshid=)[^>]*id=([^>]*)>')

Leave a Reply