How to match a partial string to a range of possible values?

I have the following dataframes:

df1:

ZIP code Other columns
1011AA
1011AA
2316XH
5815NE

df2:

starting value ZIP code range last value ZIP code range Province
1000 1200 North-Holland
1201 1500 South-Holland
1501 1570 North-Holland
1571 1600 Den Haag

I want to:

  1. Get the first four digits of df1["ZIP code"]
  2. Check if these four digits are present in any range in df2["starting value ZIP code range"] and df["last value ZIP code range"]
  3. If there is a match, get df2["Province"] and add this value to a column in df1.

The difficulty is that I need to compare this to a range of values and I can only use the first 4 digits of the string. Most examples I found on stackoverflow compare based on a single value. The desired result is:

ZIP code New column
1011AA North-Holland
1011AA North-Holland
2316XH Haarlem
5815NE Utrecht

Bonus points if you can do it using map. For example, df1["New column"] = df1["ZIP code"].str[:4].map(... ? ...). However, if the map method is a bad idea please suggest a better method.

>Solution :

As your ranges are non-overlapping, you can use a merge_asof on the starting boundary and filter its output (for example with query) to ensure it’s within the ending boundary:

df1['Province'] = (
 pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2,
               left_on='key', right_on='starting value ZIP code range')
   .query('key <= `last value ZIP code range`')['Province']
)

Output:

  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...            NaN

other example

Let’s add one more entry to df2:

# df2
   starting value ZIP code range  last value ZIP code range       Province
0                           1000                       1200  North-Holland
1                           1201                       1500  South-Holland
2                           1501                       1570  North-Holland
3                           1571                       1600       Den Haag
4                           5000                       6000        Utrecht


# output
  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...        Utrecht

Ensuring the boundaries in df2 are numeric:

df2[['starting value ZIP code range', 'last value ZIP code range']] = \
df2[['starting value ZIP code range', 'last value ZIP code range']].apply(pd.to_numeric, errors='coerce')

Leave a Reply