Home Is there a way to extract regular expression patterns from links in a Pandas dataframe?

Questions

Is there a way to extract regular expression patterns from links in a Pandas dataframe?

August 16, 2023

I am trying to extract regex patterns from links in a Pandas table generated from the page.

The code generating the Pandas data frame is given below:

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in  col])
table

I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:

scorecard_url = 'https://www.espncricinfo.com/series/australia-tour-of-new-zealand-2004-05-61407/new-zealand-vs-australia-only-t20i-211048/full-scorecard'
match_id = re.findall('t20i-(\d*)/', scorecard_url)
match_id[0]

I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.

I initially tried this simple command to do this:

table['match_id']= re.findall('t20i-(\d*)/', table['Scorecard'])
table

I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.

I then tried:

table['match_id']= re.findall('t20i-(\d*)/', str(table['Scorecard']))
table

This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.

I also tried to use the lambda function approach without success. I don’t mind using this either if it works.

>Solution :

You are close. This adds a new column with the match ID.

import pandas as pd
import re

url = 'https://www.espncricinfo.com/records/year/team-match-results/2005-2005/twenty20-internationals-3'
base_url = 'https://www.espncricinfo.com'

def match(row):
    match_id = re.findall('t20i-(\d*)/', row[1])
    return match_id[0]
    
table = pd.read_html(url, extract_links = "body")[0]
table['match'] = table['Scorecard'].apply(match)
print(table)

Output:

                 Team 1  ...   match
0   (New Zealand, None)  ...  211048
1       (England, None)  ...  211028
2  (South Africa, None)  ...  222678

[3 rows x 8 columns]