I am trying to compare columns from 2 dataframes and return the difference, ignoring case. Here is what I have so far:
import pandas as pd
if __name__ == "__main__":
data1={'Name':['Karan','Rohit','Sahil','Aryan']}
data2={'Name':['karan','Rohit','Sahil']}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(list(set(df1['Name']).difference(df2['Name'])))
This code prints ['Karan', 'Aryan']. How do I modify this to ignore case so that karan and Karan are recognized as a match and only Aryan is returned?
I don’t want to use the following because it returns aryan and I want to maintain the capitalization of the row. In my real case, they are not first names so it isn’t as easy as making the first letter capitalized again after taking the difference.
print(list(set(df1['Name'].str.lower()).difference(df2['Name'].str.lower())))
>Solution :
To perform a case insensitive comparison, use str.casefold:
print(list(set(df1['Name'].str.casefold()).difference(df2['Name'].str.casefold())))
If you want to keep the original case use boolean indexing with isin:
df1.loc[~df1['Name'].str.casefold().isin(df2['Name'].str.casefold()), 'Name'].unique()
Output:
array(['Aryan'], dtype=object)