Is there a way to extract regular expression patterns from links in a Pandas dataframe?

I am trying to extract regex patterns from links in a Pandas table generated from the page.

The code generating the Pandas data frame is given below:

import pandas as pd
import re

url = ''
base_url = ''

table = pd.read_html(url, extract_links = "body")[0]
table = table.apply(lambda col: [link[0] if link[1] is None else f'{base_url}{link[1]}' for link in  col])

I want to extract the match-id from the links in the table. For each match, the match-id is the set of successive digits that succeed the ‘t20i-‘ pattern and end before the forward slash. Example:
For this match the match-id is 211048. The code which does this for a single match is given below:

scorecard_url = ''
match_id = re.findall('t20i-(\d*)/', scorecard_url)

I want to do this for the table as a whole by using a derived column match-id that uses the Scorecard column. However I have not been able to.

I initially tried this simple command to do this:

table['match_id']= re.findall('t20i-(\d*)/', table['Scorecard'])

I got a ‘TypeError: expected string or bytes-like object’ which made me think the links are not stored as strings which might be causing the issue.

I then tried:

table['match_id']= re.findall('t20i-(\d*)/', str(table['Scorecard']))

This gave me a ‘ValueError: Length of values (0) does not match length of index (3)’ which I am not sure about.

I also tried to use the lambda function approach without success. I don’t mind using this either if it works.

>Solution :

You are close. This adds a new column with the match ID.

import pandas as pd
import re

url = ''
base_url = ''

def match(row):
    match_id = re.findall('t20i-(\d*)/', row[1])
    return match_id[0]
table = pd.read_html(url, extract_links = "body")[0]
table['match'] = table['Scorecard'].apply(match)


                 Team 1  ...   match
0   (New Zealand, None)  ...  211048
1       (England, None)  ...  211028
2  (South Africa, None)  ...  222678

[3 rows x 8 columns]

Leave a Reply