Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create Dataframe Exatracting Words With Period After A Specicfic Word

I’ve the following text:

text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."

I need to extract all the sport names (which are coming after sport:) and style (which are coming after style:) and create new columns as sports and style. I’m trying the following code to extract the main sentence (sometimes text are huge):

m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)

The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.

Then I’m extracting the sport and style names and putting them into a dataframe:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

if 'sport:' in text:
    sport_list = re.findall(r'sport:\W*(\w+)', text)

df = pd.DataFrame({'sports': sport_list})
print(df)

    sports
0   basketball
1   soccer
2   football

However, I’m having troubles to extract the styles, as all the styles have period . after the 1st letter (c) and few has sign >. Also, not all the sports have style info.

Desired output:

    sports        style
0   basketball    c.123>d
1   soccer        NA
2   football      c.124>d

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

>Solution :

You can use

\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?

See the regex demo. Details:

  • \b – a word boundary
  • sport: – a fixed string
  • \s* – zero or more whitespaces
  • (\w+) – Group 1: one or more word chars
  • (?: – start of an optional non-capturing group:
    • (?:(?!\bsport:).)*? – any char other than line break chars, zero or more occurrences but as few as possible, that does not start a whole word sport: char sequence
    • \bstyle: – a whole word style and then :
    • \s* – zero or more whitespaces
    • (\S+) – Group 1: one or more non-whitespace chars
  • )? – end of the optional non-capturing group.

See the Python demo:

import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])

Output:

>>> df
       sports    style
0  basketball   c.123>d
1      soccer          
2    football  c.124>d.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading