I have the following dfs:
data:
| ZIP code | urbanisation |
|---|---|
| 1111AA | |
| 3916HV |
reference:
| ZIP code category | urbanisation |
|---|---|
| 1111 | High |
| 3916 | Medium |
So the urbanisation in my data set is empty and I need to fill it using a measure of urbanisation I found online. I want to:
- Match column
reference["ZIP code category"]with the first 4 digits ofdata["ZIP code"], but I cannot change the actual ZIP codes. That is, I want to match based on a substring, for example by usingdata["ZIP code"].str[:4]. - For every match paste the corresponding value of
reference["urbanisation"]indata["urbanisation"]
I tried this as follows:
pd.merge(
data, reference,
left_on=['ZIP code', data["ZIP code"].str[:4]],
right_on=['ZIP code category', reference["ZIP code category"]]
)
However, this code is not correct and I do not know how to produce the desired result.
>Solution :
You can use join:
data['urbanisation'] = data['ZIP code'].str[:4].map(
reference.astype({'ZIP code category': str})
.set_index('ZIP code category')['urbanisation'])
print(data)
# Output
ZIP code urbanisation
0 1111AA High
1 3916HV Medium