Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas create a new column based on exact match of text values

I have two dataframes that look like below

proj_df = pd.DataFrame({'reg_id':[1,2,3,4],
                        'part_no':['P1','P2','P3','P4'],
                        'partner':['A','B','C','D'],
                        'cust_name_1': ['ABC PVT LTD','Tesla','Apple','Google'],
                        'cust_name_2':['ABC','Tesla Ltd','Apple Inc','Google Enterprises'],
                        'cust_name_3':['ABC','Tesla America','Apple America','Google Ent Pvt ltd']})


data_df = pd.DataFrame({'cust_name': ['ABC','Tesla America','Apple Inc','Google','Google','ABC'],
                        'partner':['A','B','C','D','E','A'],
                        'part_no':['P1','P2','P3','P4','P5','P6'],
                        'qty':[100,100,600,150,320,410]})

I would like to do the below

a) Identify the exactly matching customer name column from proj_df by comparing it with data_df.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

b) Ex: select one exactly matching column from cust_name_1, cust_name_2, cust_name_3 by comparing it with cust_name column from data_df. If two or more columns has 100% match, then choose any one of the columns.

b) Compare/merge both dataframes based on matching part_no, partner columns.

I tried the below but it is going nowhere

unique_names = data_df['cust_name'].tolist()
for name in unique_names:
    proj_df['similarity_ratio'] = proj_df.apply(lambda x: difflib.SequenceMatcher(None, name, x.name_1).ratio(), axis=1)

I expect my output to be like as below

enter image description here

>Solution :

If need 100% match use DataFrame.melt with left join, then compare both columns for exact match and for one row per part_no/partner add DataFrame.drop_duplicates:

df = (proj_df.melt(['reg_id','part_no','partner'],
                  var_name='matching_column_name', 
                  value_name='matching_column_value')
             .merge(data_df, on=['part_no','partner']))

df = (df[df['matching_column_value'].eq(df['cust_name'])]
           .drop_duplicates(['part_no','partner']))

Last for matching by original proj_df use left join only by necesary columns:

need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']

df = proj_df.merge(df[need], how='left')
print (df)
   reg_id part_no partner  cust_name_1         cust_name_2  \
0       1      P1       A  ABC PVT LTD                 ABC   
1       2      P2       B        Tesla           Tesla Ltd   
2       3      P3       C        Apple           Apple Inc   
3       4      P4       D       Google  Google Enterprises   

          cust_name_3 matching_column_name matching_column_value  qty  
0                 ABC          cust_name_2                   ABC  100  
1       Tesla America          cust_name_3         Tesla America  100  
2       Apple America          cust_name_2             Apple Inc  600  
3  Google Ent Pvt ltd          cust_name_1                Google  150  

If need maximal similar match solution use SequenceMatcher for ratio and for first maximal ratio rows use DataFrameGroupBy.idxmax:

df = (proj_df.melt(['reg_id','part_no','partner'], 
                   var_name='matching_column_name', 
                   value_name='matching_column_value')
              .merge(data_df, on=['part_no','partner'])
df['similarity_ratio'] = [difflib.SequenceMatcher(None, a, b).ratio() 
                           for a, b in zip(df['matching_column_value'], df['cust_name'])]

df = df.loc[df.groupby(['part_no','partner'])['similarity_ratio'].idxmax()]
print (df)

need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']

df = proj_df.merge(df[need], how='left')
print (df)
   reg_id part_no partner  cust_name_1         cust_name_2  \
0       1      P1       A  ABC PVT LTD                 ABC   
1       2      P2       B        Tesla           Tesla Ltd   
2       3      P3       C        Apple           Apple Inc   
3       4      P4       D       Google  Google Enterprises   

          cust_name_3 matching_column_name matching_column_value  qty  
0                 ABC          cust_name_2                   ABC  100  
1       Tesla America          cust_name_3         Tesla America  100  
2       Apple America          cust_name_2             Apple Inc  600  
3  Google Ent Pvt ltd          cust_name_1                Google  150  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading