Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract all matching substrings between two characters in the dataframe column

I have a dataframe column containing html.

I’m trying to extract everything between the list tags, (<li> and </li>) and return in a new column named ‘output’.

I’d like to include the <li> and </li> tags too as part of the output.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Minimum Reproducible Example:

import pandas as pd

data = {
    'ID': ['1', '2'],
    'Description': ['blah blah <li>Point 1</li>blah blah<li>Point 2</li>blah blah blah', 'blah<li>Point1</li>blah<li>Point 2</li>']}

df = pd.DataFrame(data)
df['new'] = df.Description.apply(lambda st: st[st.find("<li>")+1:st.find("</li>")])
print(df)

Desired Output

  ID                                        Description         output
0  1  blah blah <li>Point 1</li>blah blah<li>Point 2...  <li>Point 1</l><li>Point 2</li>
1  2            blah<li>Point1</li>blah<li>Point 2</li>  <li>Point 1</l><li>Point 2</li>

What I’ve tried:
Although there appears to be a lot of solutions around (‘extracting substrings between two strings’) there isn’t anything that comes close.

For example, this (and many other result) only return the first instance
Extract substring between two characters in pandas but I need to extract all instances.

Also I haven’t seen any that will preserve the <li> tags.

>Solution :

You can use look-behind assertion and look-ahead assertion to get your desired result. You can look into this amazing answer to understand more.

In your case, (?=<li>) will match the string followed by <li> while (?<=</li>) will match the string preceded by </li>.

result = df.Description.str.findall(r'(?=<li>).*?(?<=</li>)')

0    [<li>Point 1</li>, <li>Point 2</li>]
1     [<li>Point1</li>, <li>Point 2</li>]
Name: Description, dtype: object

Eventually, you assign it back to the dataframe.

df['output'] = result
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading