Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to replace the use of two for's(), a list and a dataframe in python?

I have a dataframe and a string list:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                                  'PORTUGAL', 'PORTUGLA'],                   
                         'Column_two': [1,2,3,4,5,6,7,8]                 
                         })

      print(df)

      # Output:

      Name   Column_two
     PARIS       1
     NEW YORK    2
     MADRI       3
      PARI       4
     P ARIS      5
    NOW YORK     6
    PORTUGAL     7
    PORTUGLA     8

      list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']

I am using Fuzzywuzzy python library. This method returns a number that represents how similar the two compared strings are:
Example:
fuzz.partial_ratio("BRASIL", "BRAZIL")

     # Output:
     88

I would like to iterate through the ‘Name’ column of the dataframe and compare the string to var_string_correct. If these are similar, I would like to replace it with the correct name (which is the name of the string). So, I made the following code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

      for i in range(0, len(df)):
          for j in range(0, len(list_string_correct)):
    
              var_string = list_string_correct[j] 

              # Return number [0 until 100]       
              result = fuzz.partial_ratio(var_string, df['Name'].iloc[i]) 
    
              if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition            
                   df['Name'].loc[i] = var_string

The code is working. The output is as desired:

     print(df)

     # Output:

         Name   Column_two
         PARIS      1
        NEW YORK    2
         MADRI      3
         PARIS      4
         PARIS      5
        NEW YORK    6
        PORTUGAL    7
        PORTUGAL    8

However, I needed to use two for() commands. Is there a way to replace the for() and keep the same output?

To install the libraries use:

      pip install fuzzywuzzy
      pip install python-Levenshtein

>Solution :

Try process.extractOne from thefuzz package (successor of fuzzywuzzy, same author, same api):

# from fuzzywuzzy import process
from thefuzz import process

THRESHOLD = 80

df['Name'] = \
    df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
                                   score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])

Output:

>>> df
       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading