Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: Join with pratial match (like VLOOKUP) but in certain order

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. but based on the first part of a string, problem is that firt part is not of a certain lenghts.

Ex: I have refrence data of Gouna and GreenLand, but lookupvalue for Gouna sometimes starts with G and other times starts with Gou and for lookupvalues for GreenLand starts with Gre

I have the following two pandas dataframes:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df1 = pd.DataFrame({'Abb': ['G', 'GRE', 'Gou', 'B'],
                    'FullName': ['Gouna', 'GreenLand', 'Gouna', 'Bahr']})

df2 = pd.DataFrame({'OrderNo': ['INV20561', 'INV20562', 'INV20563', 'INV20564'],
                    'AreaName': ['GRE65335', 'Gou6D654', 'Gddd654', 'B65465']})


print(df1)

   Abb   FullName
0    G      Gouna
1  GRE  GreenLand
2  Gou      Gouna
3    B    Bahrain

print(df2)

    OrderNo  AreaName
0  INV20561  GRE65335
1  INV20562  Gou6D654
2  INV20563   Gddd654
3  INV20564    B65465

and my needed out put should be:

    OrderNo     AreaName    FullName
0   INV20561    GRE65335    GreenLand
1   INV20562    Gou6D654    Gouna
2   INV20563    Gddd654     Gouna
3   INV20564    B65465      Bahr

My approach would be to sort the Abb values in the df1 descendingly by values length:

df1.sort_values(by="Abb", key=lambda x: x.str.len(), ascending=False)

    Abb FullName
1   GRE GreenLand
2   Gou Gouna
0   G   Gouna
3   B   Bahrain

then perform some sort with vlookup with for loop instead of or applying a custom function. and here is where I am stuck

thanks in advance

>Solution :

You can craft a regex to extract the country Abb, then use this as a merging key:

# we need to sort the Abb by decreasing length to ensure
# specific Abb match before more generic (e.g. Gou/GRE match before G)
regex = '|'.join(df1['Abb'].sort_values(key=lambda s: s.str.len(),
                                        ascending=False)
                 )
# 'GRE|Gou|G|B'

out = df2.merge(df1, right_on='Abb',
                left_on=df2['AreaName'].str.extract(f'^({regex})', expand=False)
                )

If case does not matter:

key = df1['Abb'].str.lower()
regex = '|'.join(key
                 .sort_values(key=lambda s: s.str.len(), ascending=False)
                 )
# 'gre|gou|g|b'

out = df2.merge(df1, right_on=key,
                left_on=df2['AreaName']
                        .str.lower()
                        .str.extract(f'^({regex})', expand=False)
                ).drop(columns='key_0')

output:

    OrderNo  AreaName  Abb   FullName
0  INV20561  GRE65335  GRE  GreenLand
1  INV20562  Gou6D654  Gou      Gouna
2  INV20563   Gddd654    G      Gouna
3  INV20564    B65465    B       Bahr
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading