I have a dataframe and a string list:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARI 4
P ARIS 5
NOW YORK 6
PORTUGAL 7
PORTUGLA 8
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
I am using Fuzzywuzzy python library. This method returns a number that represents how similar the two compared strings are:
Example:
fuzz.partial_ratio("BRASIL", "BRAZIL")
# Output:
88
I would like to iterate through the ‘Name’ column of the dataframe and compare the string to var_string_correct. If these are similar, I would like to replace it with the correct name (which is the name of the string). So, I made the following code:
for i in range(0, len(df)):
for j in range(0, len(list_string_correct)):
var_string = list_string_correct[j]
# Return number [0 until 100]
result = fuzz.partial_ratio(var_string, df['Name'].iloc[i])
if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition
df['Name'].loc[i] = var_string
The code is working. The output is as desired:
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARIS 4
PARIS 5
NEW YORK 6
PORTUGAL 7
PORTUGAL 8
However, I needed to use two for() commands. Is there a way to replace the for() and keep the same output?
To install the libraries use:
pip install fuzzywuzzy
pip install python-Levenshtein
>Solution :
Try process.extractOne from thefuzz package (successor of fuzzywuzzy, same author, same api):
# from fuzzywuzzy import process
from thefuzz import process
THRESHOLD = 80
df['Name'] = \
df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])
Output:
>>> df
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8