Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find a string between two substrings, BUT the end of the first is the start of the next one

So I have a string that goes like this:

...<p><noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District  Court<b>Defendant Lobby No. 2<color:0><p><hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0><p> <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!<p>...

What do I need: find everything between <p>s. (Note that the ending one is also a starting one for the next.)

My code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

...
filetext = open(fn).read()
tag = '<p>'
result = re.findall(tag+"(.*?)"+tag,filetext,re.DOTALL)
print(result)
...

Expected output:

['<noop><fademusic:23,0><26:1><wait:30>\n<speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District \nCourt<b>Defendant Lobby No. 2<color:0>', '<hidetextbox:1><5D:0>\n<speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7>\n<person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0>\n<name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0>', '\n<hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31>\nWright!']

Resulting output:

['<noop><fademusic:23,0><26:1><wait:30>\n<speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District \nCourt<b>Defendant Lobby No. 2<color:0>', '\n<hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31>\nWright!']

>Solution :

No need for re module, just use str.split('<p>'). You may not want the empty strings in the result if <p> starts or ends a string so here is a solution if so:

s = '<p><noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District  Court<b>Defendant Lobby No. 2<color:0><p><hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0><p> <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!<p>'
result = s.split('<p>')
for n in (0, -1):
    if result and not result[n]:
        del result[n]
print(result)

Output:

['<noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District  Court<b>Defendant Lobby No. 2<color:0>', '<hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0>', ' <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!']

If you don’t want any empty strings, e.g., 'abc<p><p>def' would return ['abc', '', 'def'], then use:

result = [n for n in s.split('<p>') if n]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading