I’m trying to clean up some data from web scraping.
This is an example of the information I’m working with:
Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)
And this is an example of what I’m trying to achieve:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
Two new lines after each right parentheses, but as the times are all different I don’t know how I can search and replace it. Also numbers may sometimes occur outside of the times.
Closest I’ve come is by searching for numbers inside the parentheses with a colon to separate them, but I don’t know how to replace that with the same information.
re.sub(r"\([0-9]+:[0-9]+\)", "\n\n", result)
Anyone know how I can achieve this?
TTIA.
>Solution :
Notice that the place where you need to insert two newlines comes between an end parenthesis and an alphabetic character. So, you can use:
re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
For example:
import re
data = """Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)"""
result = re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
print(result)
outputs:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (a) (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
Here’s an explanation for how it works:
For the expression we’re trying to match, we have r"\)([A-Za-z])":
\)matches a literal end parenthesis.[A-Za-z]matches a single alphabetic character.- Enclosing
[A-Za-z]in parentheses makes it a capture group that we refer to later.
For the replacement expression, we have r")\n\n\1":
)\n\nadds an end parenthesis plus two new lines.\1refers to the capture group from earlier. Intuitively, we capture the alphabetic character immediately after the end parenthesis, and then add that same character back into the replacement expression.