Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I use regex to add line breaks and also preserve times?

I’m trying to clean up some data from web scraping.

This is an example of the information I’m working with:

Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)

And this is an example of what I’m trying to achieve:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Best Time
Adam Jones (w/ help) (6:34)

Best Time
Kenny Gobbin (2:38)

Personal Best
Matt Herrera (12:44)

No-record
Nick Elizabeth (19:04)

Two new lines after each right parentheses, but as the times are all different I don’t know how I can search and replace it. Also numbers may sometimes occur outside of the times.

Closest I’ve come is by searching for numbers inside the parentheses with a colon to separate them, but I don’t know how to replace that with the same information.

re.sub(r"\([0-9]+:[0-9]+\)", "\n\n", result)

Anyone know how I can achieve this?
TTIA.

>Solution :

Notice that the place where you need to insert two newlines comes between an end parenthesis and an alphabetic character. So, you can use:

re.sub(r"\)([A-Za-z])", r")\n\n\1", data)

For example:

import re
data = """Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)"""

result = re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
print(result)

outputs:

Best Time
Adam Jones (w/ help) (6:34)

Best Time
Kenny Gobbin (a) (2:38)

Personal Best
Matt Herrera (12:44)

No-record
Nick Elizabeth (19:04)

Here’s an explanation for how it works:

For the expression we’re trying to match, we have r"\)([A-Za-z])":

  • \) matches a literal end parenthesis.
  • [A-Za-z] matches a single alphabetic character.
  • Enclosing [A-Za-z] in parentheses makes it a capture group that we refer to later.

For the replacement expression, we have r")\n\n\1":

  • )\n\n adds an end parenthesis plus two new lines.
  • \1 refers to the capture group from earlier. Intuitively, we capture the alphabetic character immediately after the end parenthesis, and then add that same character back into the replacement expression.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading