Home How to merge two dfs based on a substring of the strings in a column and insert values of another column?

Questions

How to merge two dfs based on a substring of the strings in a column and insert values of another column?

February 17, 2023

I have the following dfs:

data:

ZIP code	urbanisation
1111AA
3916HV

reference:

ZIP code category	urbanisation
1111	High
3916	Medium

So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:

Match column reference["ZIP code category"] with the first 4 digits of data["ZIP code"], but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by using data["ZIP code"].str[:4].
For every match paste the corresponding value of reference["urbanisation"] in data["urbanisation"]

I tried this as follows:

pd.merge(
    data, reference,
    left_on=['ZIP code', data["ZIP code"].str[:4]],
    right_on=['ZIP code category', reference["ZIP code category"]]
)

However, this code is not correct and I do not know how to produce the desired result.

>Solution :

You can use join:

data['urbanisation'] = data['ZIP code'].str[:4].map(
                           reference.astype({'ZIP code category': str})
                                    .set_index('ZIP code category')['urbanisation'])
print(data)

# Output
  ZIP code urbanisation
0   1111AA         High
1   3916HV       Medium