Home Retain original number of rows with dataset to be matched, when pairing values from two different datasets in Pandas

Questions

Retain original number of rows with dataset to be matched, when pairing values from two different datasets in Pandas

byMR

July 13, 2023

Data

df1

ID                  stat
AA1                 exzone
BB2                 exzone5
CC4                 limit5

df2

name                state
AA1                 NY
AA1                 NY
AA1                 NY
AA1                 NY
BB2                 GA
BB2                 GA
BB2                 GA
CC4                 CA
CC4                 CA

Desired

name                stat          state
AA1                 exzone        NY
BB2                 exzone5       GA
CC4                 limit5        CA

Doing

out = pd.merge(df1,df2, left_on=['ID'], right_on= ['name'], how="left")

however, the above script is giving an exploded output and does not retain the original Left dataframe row count. Any suggestion is appreciated.

>Solution :

A left merge doesn’t mean that the structure will be identical to that of the original left DataFrame. It means that all the left keys will be preserved, even if absent from the right DataFrame. In your case the duplicated keys on the right force the merge to compute all combinations of the rows.

You need to first remove the duplicates:

out = pd.merge(df1, df2.drop_duplicates(), left_on=['ID'], right_on= ['name'], how="left")

If for some reason you have several different states per name, you should find another way to aggregate (pick the first, last, combine the unique states as a single string, etc.), or accept to have row duplications.