I have two pandas DataFrames:
import pandas as pd
data1 = {
'score': [1, 2],
'seconds': [1140, 2100],
}
data2 = {
'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}
df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)
Output: df1
score seconds
0 1 1140
1 2 2100
Output: df2
prize seconds
0 5.5 840
1 14.5 1080
2 14.6 1380
3 21.0 1620
4 23.0 1650
5 24.0 1680
6 26.0 1700
7 38.0 1740
8 39.0 2040
9 40.0 2100
10 50.0 2160
For each value in seconds column from df1, I would like to get the match (or the closest to) row from df2 and also the closest 2 rows above and below the match.
The seconds columns contains only sorted unique values.
As result, I expect this:
Output: result
prize seconds
0 5.5 840
1 14.5 1080 # closest match to 1140
2 14.6 1380
3 21.0 1620
7 38.0 1740
8 39.0 2040
9 40.0 2100 # match 2100
10 50.0 2160
>Solution :
You can use a merge_asof to identify the closest value to each value in df1, then a rolling.max to extend the selection to the neighboring N rows:
N = 2 # number of surronding rows to keep
s1 = df1['seconds'].sort_values()
s2 = df2['seconds'].sort_values().rename('_')
keep = pd.merge_asof(s1, s2, left_on='seconds', right_on='_',
direction='nearest')['_']
out = df2[s2.isin(keep)
.rolling(2*N+1, center=True, min_periods=1)
.max().astype(bool)]
NB. if the seconds are already sorted, you can skip the .sort_values().
Output:
prize seconds
0 5.5 840
1 14.5 1080
2 14.6 1380
3 21.0 1620
7 38.0 1740
8 39.0 2040
9 40.0 2100
10 50.0 2160
Intermediates:
prize seconds closest isin(keep) rolling.max
0 5.5 840 NaN False True
1 14.5 1080 1140.0 True True
2 14.6 1380 NaN False True
3 21.0 1620 NaN False True
4 23.0 1650 NaN False False
5 24.0 1680 NaN False False
6 26.0 1700 NaN False False
7 38.0 1740 NaN False True
8 39.0 2040 NaN False True
9 40.0 2100 2100.0 True True
10 50.0 2160 NaN False True