Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there a way to extract regular expression patterns from links in a Pandas dataframe?

I am trying to extract regex patterns from links in a Pandas table generated from the page.

The code generating the Pandas data frame is given below:

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in  col])
table

I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new-zealand-vs-australia-only-t20i-211048/full-scorecard'
match_id = re.findall('t20i-(\d*)/', scorecard_url)
match_id[0]

I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.

I initially tried this simple command to do this:

table['match_id']= re.findall('t20i-(\d*)/', table['Scorecard'])
table

I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.

I then tried:

table['match_id']= re.findall('t20i-(\d*)/', str(table['Scorecard']))
table

This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.

I also tried to use the lambda function approach without success. I don’t mind using this either if it works.

>Solution :

You are close. This adds a new column with the match ID.

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

def match(row):
    match_id = re.findall('t20i-(\d*)/', row[1])
    return match_id[0]
    
table = pd.read_html(url, extract_links = "body")[0]
table['match'] = table['Scorecard'].apply(match)
print(table)

Output:

                 Team 1  ...   match
0   (New Zealand, None)  ...  211048
1       (England, None)  ...  211028
2  (South Africa, None)  ...  222678

[3 rows x 8 columns]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading