Extracting information from a list of strings using regex

April 19, 2023

I have a list of strings from which I wish to extract information around amount, percentages etc. Being new to regex I have been struggling with the process. Below are my input & desired output & the piece of code that I tried using.

Input list:

['0.09% of the first GBP£250 million of the Company’s Net Asset Value;',
'0.08% of the next GBP£250 million of the Company’s Net Asset Value;',
   "0.06% of the next GBP£500 million of the Company's Net Asset Value; and",
'in accordance with the formula GBP£22,000 + 365, Minimum fee to be' ]

Code:

import re

def extract_pounds(text):
    regex = "£(\w+)"
    return re.findall(regex, str(text))


for word in empty_df:
    pounds = extract_pounds(word)
    print(pounds)

I am getting the following output which is far from being close to my desired output:

['250']
['250']
['500']
['22']
['22']

Desired output:

 Tier    Amount                  Minimum Fee
 0.09%   first GBP£250 million   GBP£22,000
 0.08%   next GBP£250 million    
 0.06%   next GBP£500 million

>Solution :

With pandas, you can try something like this :

import re
import pandas 

pat = r"([\d.]+%) of the (\w+ GBP£\d+ \w+)"
df = pd.Series(lst[:-1]).str.extract(pat).set_axis(["Tier", "Amount"], axis=1)

df.loc[0, "Minimum Fee"] = re.search("GBP£\d+,\d+", lst[-1]).group(0)

Output :

print(df)

    Tier                 Amount Minimum Fee
0  0.09%  first GBP£250 million  GBP£22,000
1  0.08%   next GBP£250 million         NaN
2  0.06%   next GBP£500 million         NaN