How to match a partial string to a range of possible values?

I have the following dataframes:

df1:

ZIP code	Other columns
1011AA	…
1011AA	…
2316XH	…
5815NE	…

df2:

starting value ZIP code range	last value ZIP code range	Province
1000	1200	North-Holland
1201	1500	South-Holland
1501	1570	North-Holland
1571	1600	Den Haag

I want to:

Get the first four digits of df1["ZIP code"]
Check if these four digits are present in any range in df2["starting value ZIP code range"] and df["last value ZIP code range"]
If there is a match, get df2["Province"] and add this value to a column in df1.

The difficulty is that I need to compare this to a range of values and I can only use the first 4 digits of the string. Most examples I found on stackoverflow compare based on a single value. The desired result is:

ZIP code	New column
1011AA	North-Holland
1011AA	North-Holland
2316XH	Haarlem
5815NE	Utrecht

Bonus points if you can do it using map. For example, df1["New column"] = df1["ZIP code"].str[:4].map(... ? ...). However, if the map method is a bad idea please suggest a better method.

>Solution :

As your ranges are non-overlapping, you can use a merge_asof on the starting boundary and filter its output (for example with query) to ensure it’s within the ending boundary:

df1['Province'] = (
 pd.merge_asof(df1.assign(key=df1['ZIP code'].str.extract('(\d+)', expand=False).astype(int)), df2,
               left_on='key', right_on='starting value ZIP code range')
   .query('key <= `last value ZIP code range`')['Province']
)

Output:

  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...            NaN

other example

Let’s add one more entry to df2:

# df2
   starting value ZIP code range  last value ZIP code range       Province
0                           1000                       1200  North-Holland
1                           1201                       1500  South-Holland
2                           1501                       1570  North-Holland
3                           1571                       1600       Den Haag
4                           5000                       6000        Utrecht


# output
  ZIP code Other columns       Province
0   1011AA           ...  North-Holland
1   1011AA           ...  North-Holland
2   2316XH           ...            NaN
3   5815NE           ...        Utrecht

Ensuring the boundaries in df2 are numeric:

df2[['starting value ZIP code range', 'last value ZIP code range']] = \
df2[['starting value ZIP code range', 'last value ZIP code range']].apply(pd.to_numeric, errors='coerce')

Dev solutions

Solutions for development problems

How to match a partial string to a range of possible values?

>Solution :

other example

Like this:

Related

Leave a ReplyCancel reply

>Solution :

other example

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Dev solutions