Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Apply Levenshtein distance from rapidfuzz.distance to dataframe with two columns

I have a csv file that looks as follows:

ID; name1; name2
1; John Doe; John Does
2; Mike Johnson; Mike Jonson
3; Leon Mill; Leon Miller
4; Jack Jo; Jack Joe

Now I want to calculate the Levenshtein distance for each pair of name. So compare "John Doe" to "John Does" and put this into a new column. Then the next comparison is made for "Mike Johnson" and "Mike Jonson". So the output would be as follows:

ID; name1; name2;ld
1; John Doe; John Does;1
2; Mike Johnson; Mike Jonson;1
3; Leon Mill; Leon Miller;2
4; Jack Jo; Jack Joe;1

I tried it (see How do I calculate the Levenshtein distance between two Pandas DataFrame columns?) as follows:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from rapidfuzz.distance import Levenshtein
import pandas as pd

df = pd.read_csv(r'C:\Users\myuser\Downloads\Testfile.csv', sep=";")
print(df)

df['ld']=df.apply(lambda x: Levenshtein.distance(df['name1'], df['name2']), axis=1)

But I am getting an error:

KeyError: 'name1'

Where is my mistake?

>Solution :

In lambda function try to call an x variable that defines it.

df['ld']=df.apply(lambda x: Levenshtein.distance(x['name1'], x['name2']), axis=1)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading