Home Create Dataframe Exatracting Words With Period After A Specicfic Word

Questions

Create Dataframe Exatracting Words With Period After A Specicfic Word

May 18, 2022

I’ve the following text:

text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."

I need to extract all the sport names (which are coming after sport:) and style (which are coming after style:) and create new columns as sports and style. I’m trying the following code to extract the main sentence (sometimes text are huge):

m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)

The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.

Then I’m extracting the sport and style names and putting them into a dataframe:

if 'sport:' in text:
    sport_list = re.findall(r'sport:\W*(\w+)', text)

df = pd.DataFrame({'sports': sport_list})
print(df)

    sports
0   basketball
1   soccer
2   football

However, I’m having troubles to extract the styles, as all the styles have period . after the 1st letter (c) and few has sign >. Also, not all the sports have style info.

Desired output:

    sports        style
0   basketball    c.123>d
1   soccer        NA
2   football      c.124>d

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

>Solution :

You can use

\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?

See the regex demo. Details:

\b – a word boundary
sport: – a fixed string
\s* – zero or more whitespaces
(\w+) – Group 1: one or more word chars
(?: – start of an optional non-capturing group:
- (?:(?!\bsport:).)*? – any char other than line break chars, zero or more occurrences but as few as possible, that does not start a whole word sport: char sequence
- \bstyle: – a whole word style and then :
- \s* – zero or more whitespaces
- (\S+) – Group 1: one or more non-whitespace chars
)? – end of the optional non-capturing group.

See the Python demo:

import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])

Output:

>>> df
       sports    style
0  basketball   c.123>d
1      soccer          
2    football  c.124>d.

byMR

Published May 18, 2022

Add a comment

Azure Storage Account Lifecycle Management filter prefix for blobs in dynamically created sub-directories

byMR

May 18, 2022

Questions

How to place html tags inside javascript string?

byMR

May 18, 2022

Questions

Looking to place text over image in React

byMR

May 18, 2022

Questions

Is it possible to have "sub-models" with Entity Framework?

byMR

May 18, 2022

Questions

How to see if some data is already in sqlite3 database?

byMR

May 18, 2022

Questions

How I can implement e.preventDefault(); in useEffect?

byMR

May 18, 2022

Create Dataframe Exatracting Words With Period After A Specicfic Word

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Azure Storage Account Lifecycle Management filter prefix for blobs in dynamically created sub-directories

How to place html tags inside javascript string?

Is it possible to have "sub-models" with Entity Framework?

How to see if some data is already in sqlite3 database?

Keep Up to Date with the Most Important News

Create Dataframe Exatracting Words With Period After A Specicfic Word

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Azure Storage Account Lifecycle Management filter prefix for blobs in dynamically created sub-directories

How to place html tags inside javascript string?

Looking to place text over image in React

Is it possible to have "sub-models" with Entity Framework?

How to see if some data is already in sqlite3 database?

How I can implement e.preventDefault(); in useEffect?

Discover more from Dev solutions