How to replace the use of two for's(), a list and a dataframe in python?

January 7, 2022

I have a dataframe and a string list:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                                  'PORTUGAL', 'PORTUGLA'],                   
                         'Column_two': [1,2,3,4,5,6,7,8]                 
                         })

      print(df)

      # Output:

      Name   Column_two
     PARIS       1
     NEW YORK    2
     MADRI       3
      PARI       4
     P ARIS      5
    NOW YORK     6
    PORTUGAL     7
    PORTUGLA     8

      list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']

I am using Fuzzywuzzy python library. This method returns a number that represents how similar the two compared strings are:
Example:
fuzz.partial_ratio("BRASIL", "BRAZIL")

     # Output:
     88

I would like to iterate through the ‘Name’ column of the dataframe and compare the string to var_string_correct. If these are similar, I would like to replace it with the correct name (which is the name of the string). So, I made the following code:

      for i in range(0, len(df)):
          for j in range(0, len(list_string_correct)):
    
              var_string = list_string_correct[j] 

              # Return number [0 until 100]       
              result = fuzz.partial_ratio(var_string, df['Name'].iloc[i]) 
    
              if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition            
                   df['Name'].loc[i] = var_string

The code is working. The output is as desired:

     print(df)

     # Output:

         Name   Column_two
         PARIS      1
        NEW YORK    2
         MADRI      3
         PARIS      4
         PARIS      5
        NEW YORK    6
        PORTUGAL    7
        PORTUGAL    8

However, I needed to use two for() commands. Is there a way to replace the for() and keep the same output?

To install the libraries use:

      pip install fuzzywuzzy
      pip install python-Levenshtein

>Solution :

Try process.extractOne from thefuzz package (successor of fuzzywuzzy, same author, same api):

# from fuzzywuzzy import process
from thefuzz import process

THRESHOLD = 80

df['Name'] = \
    df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
                                   score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])

Output:

>>> df
       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8