Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge two dfs based on a substring of the strings in a column and insert values of another column?

I have the following dfs:

data:

ZIP code urbanisation
1111AA
3916HV

reference:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ZIP code category urbanisation
1111 High
3916 Medium

So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:

  • Match column reference["ZIP code category"] with the first 4 digits of data["ZIP code"], but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by using data["ZIP code"].str[:4].
  • For every match paste the corresponding value of reference["urbanisation"] in data["urbanisation"]

I tried this as follows:

pd.merge(
    data, reference,
    left_on=['ZIP code', data["ZIP code"].str[:4]],
    right_on=['ZIP code category', reference["ZIP code category"]]
)

However, this code is not correct and I do not know how to produce the desired result.

>Solution :

You can use join:

data['urbanisation'] = data['ZIP code'].str[:4].map(
                           reference.astype({'ZIP code category': str})
                                    .set_index('ZIP code category')['urbanisation'])
print(data)

# Output
  ZIP code urbanisation
0   1111AA         High
1   3916HV       Medium
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading