Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to find the string match between 2 Excel files and return the match percentage in python?

I have 2 Excel files which contains names as the only column:

File 1: file1.xlsx

Names
Vinay adkz
Sagarbhansali
Jeffery Jonas
Kiara Francis
Dominic 

File 2: file2.xlsx

Names:
bhansali Sagar
Dominic
Jenny
adkzVinay

Sample Output:
I want to match the names in file 1 with names in file 2, and i am trying to get an output like the below :

Names         File2Matchname.  Match%
Vinay adkz.     adkzVinay.       98%
Sagarbhansali.  bhansali sagar   97%
Jeffery Jonas       NA            0%
Kiara Francis       NA            0%
Dominic          Dominic         100%



Is there any logic by which the above logic can be arrived in python ?

I tried to do this in Excel but vlookup doesn’t help with match%. I know this is possible with python using cousine similarity but i am unable to get the logic in which the output can be arrived.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any help would be much appreciated.

>Solution :

You can use Pandas and use python’s built-in difflib library which has a function called difflib.SequenceMatcher() function to find the longest common substring between the two names.

Example code:

import pandas as pd
import difflib

#For testing
dict_lists = {"Names":["Vinay adkz", "Shailesh", "Seema", "Nidhi","Ajitesh"]}
dict_lists2 = {"Names":["Ajitesh", "Vinay adkz", "Seema", "Nid"]}

# Excel to  dataframes
df1 = pd.DataFrame(dict_lists) #pd.read_excel('file1.xlsx')
df2 = pd.DataFrame(dict_lists2) #pd.read_excel('file2.xlsx')

# Empty lists to stor matched name, match percentage
match_name = []
match_percent = []

# Iterate through the first dataframe
for i, row in df1.iterrows():
    name = row['Names']
    match = difflib.get_close_matches(name, df2['Names'], n=1, cutoff=0.8)
    if match:
        match_name.append(match[0])
        match_string = difflib.SequenceMatcher(None, name, match[0]).find_longest_match(0, len(name), 0, len(match[0]))
        match_percentage = (match_string.size / len(name)) * 100
        match_percent.append(match_percentage)
    else:
        match_name.append('NA')
        match_percent.append(0)
      

df1['File2names'] = match_name
df1['Match_per'] = match_percent
print(df1)

# Write in Excel
# df1.to_excel('output.xlsx', index=False)

I hope this helps you. This is the first time I am answering a question here.
Read also: How to use SequenceMatcher to find similarity between two strings?

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading