Advertisements
I have the following dataframes
:
df1
:
ZIP code | Other columns |
---|---|
1011AA | … |
1011AA | … |
2316XH | … |
5815NE | … |
df2
:
starting value ZIP code range | last value ZIP code range | Province |
---|---|---|
1000 | 1200 | North-Holland |
1201 | 1500 | South-Holland |
1501 | 1570 | North-Holland |
1571 | 1600 | Den Haag |
I want to:
- Get the first four digits of
df1["ZIP code"]
- Check if these four digits are present in any range in
df2["starting value ZIP code range"]
anddf["last value ZIP code range"]
- If there is a match, get
df2["Province"]
and add this value to a column indf1
.
The difficulty is that I need to compare this to a range of values and I can only use the first 4 digits of the string. Most examples I found on stackoverflow compare based on a single value. The desired result is:
ZIP code | New column |
---|---|
1011AA | North-Holland |
1011AA | North-Holland |
2316XH | Haarlem |
5815NE | Utrecht |
Bonus points if you can do it using map. For example, df1["New column"] = df1["ZIP code"].str[:4].map(... ? ...)
. However, if the map
method is a bad idea please suggest a better method.
>Solution :
As your ranges are non-overlapping, you can use a merge_asof
on the starting boundary and filter its output (for example with query
) to ensure it’s within the ending boundary:
df1['Province'] = (
pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2,
left_on='key', right_on='starting value ZIP code range')
.query('key <= `last value ZIP code range`')['Province']
)
Output:
ZIP code Other columns Province
0 1011AA ... North-Holland
1 1011AA ... North-Holland
2 2316XH ... NaN
3 5815NE ... NaN
other example
Let’s add one more entry to df2
:
# df2
starting value ZIP code range last value ZIP code range Province
0 1000 1200 North-Holland
1 1201 1500 South-Holland
2 1501 1570 North-Holland
3 1571 1600 Den Haag
4 5000 6000 Utrecht
# output
ZIP code Other columns Province
0 1011AA ... North-Holland
1 1011AA ... North-Holland
2 2316XH ... NaN
3 5815NE ... Utrecht
Ensuring the boundaries in df2
are numeric:
df2[['starting value ZIP code range', 'last value ZIP code range']] = \
df2[['starting value ZIP code range', 'last value ZIP code range']].apply(pd.to_numeric, errors='coerce')