Home Given a value from a pandas column DataFrame, select N rows above and below to that closest value in other DataFrame

Questions

Given a value from a pandas column DataFrame, select N rows above and below to that closest value in other DataFrame

byMR

February 27, 2024

I have two pandas DataFrames:

import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}

data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

Output: df1
   score  seconds
0      1     1140
1      2     2100

Output: df2
    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
4    23.0     1650
5    24.0     1680
6    26.0     1700
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

For each value in seconds column from df1, I would like to get the match (or the closest to) row from df2 and also the closest 2 rows above and below the match.

The seconds columns contains only sorted unique values.

As result, I expect this:

Output: result
    prize  seconds
0     5.5      840
1    14.5     1080 # closest match to 1140
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100 # match 2100
10   50.0     2160

>Solution :

You can use a merge_asof to identify the closest value to each value in df1, then a rolling.max to extend the selection to the neighboring N rows:

N = 2 # number of surronding rows to keep

s1 = df1['seconds'].sort_values()
s2 = df2['seconds'].sort_values().rename('_')

keep = pd.merge_asof(s1, s2, left_on='seconds', right_on='_',
                     direction='nearest')['_']

out = df2[s2.isin(keep)
            .rolling(2*N+1, center=True, min_periods=1)
            .max().astype(bool)]

NB. if the seconds are already sorted, you can skip the .sort_values().

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

Intermediates:

    prize  seconds  closest  isin(keep)  rolling.max
0     5.5      840      NaN       False         True
1    14.5     1080   1140.0        True         True
2    14.6     1380      NaN       False         True
3    21.0     1620      NaN       False         True
4    23.0     1650      NaN       False        False
5    24.0     1680      NaN       False        False
6    26.0     1700      NaN       False        False
7    38.0     1740      NaN       False         True
8    39.0     2040      NaN       False         True
9    40.0     2100   2100.0        True         True
10   50.0     2160      NaN       False         True