Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Capture all characters in single string between regex matches

I have a log file with the following format:

00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR

00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR

00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR

What I am trying to do is capture all of the text beneath the log data for each entry, so ultimately, end up with a list looking like:

['FOO BAR FOO FOO FOO BAR', 'BAR BAR BAR' BAR. FOO.BAR', 'BOO.BOO. FARFAR.FAR']

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have written the following regular expression and tested that it properly matches the log data:

"\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d side:start top:\d\d% bottom:\d\d% sound:\d\d%"

But I am looking to capture all of the information between these matches, and I am not certain if this is even the best way to do it, vs iterating through the 123,378 line text file and ignoring both blank spaces and matches to the above expression.

What is the most efficient way to return a list of the text after each log entry?

>Solution :

You can use re.findall with a pattern using a lookahead:

^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)

Regex demo

import re

pattern = r"^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)"

s = ("00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%\n"
            "FOO BAR FOO FOO FOO BAR\n\n"
            "00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%\n"
            "BAR BAR BAR' BAR. FOO.BAR\n\n"
            "00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%\n"
            "BOO.BOO. FARFAR.FAR")

res = [x.strip() for x in re.findall(pattern, s, re.M)]
print(res)

Output

['FOO BAR FOO FOO FOO BAR', "BAR BAR BAR' BAR. FOO.BAR", 'BOO.BOO. FARFAR.FAR']

Or if the data is that specific, shorten it to:

^\d\d:\d\d:\d\d.\d{3} ;; .*((?:\n(?!\d\d:\d\d:\d\d.\d{3} ;;).*)*)

Regex demo

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading