Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Given a value from a pandas column DataFrame, select N rows above and below to that closest value in other DataFrame

I have two pandas DataFrames:

import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}

data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

Output: df1
   score  seconds
0      1     1140
1      2     2100

Output: df2
    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
4    23.0     1650
5    24.0     1680
6    26.0     1700
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

For each value in seconds column from df1, I would like to get the match (or the closest to) row from df2 and also the closest 2 rows above and below the match.

The seconds columns contains only sorted unique values.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

As result, I expect this:

Output: result
    prize  seconds
0     5.5      840
1    14.5     1080 # closest match to 1140
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100 # match 2100
10   50.0     2160

>Solution :

You can use a merge_asof to identify the closest value to each value in df1, then a rolling.max to extend the selection to the neighboring N rows:

N = 2 # number of surronding rows to keep

s1 = df1['seconds'].sort_values()
s2 = df2['seconds'].sort_values().rename('_')

keep = pd.merge_asof(s1, s2, left_on='seconds', right_on='_',
                     direction='nearest')['_']

out = df2[s2.isin(keep)
            .rolling(2*N+1, center=True, min_periods=1)
            .max().astype(bool)]

NB. if the seconds are already sorted, you can skip the .sort_values().

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

Intermediates:

    prize  seconds  closest  isin(keep)  rolling.max
0     5.5      840      NaN       False         True
1    14.5     1080   1140.0        True         True
2    14.6     1380      NaN       False         True
3    21.0     1620      NaN       False         True
4    23.0     1650      NaN       False        False
5    24.0     1680      NaN       False        False
6    26.0     1700      NaN       False        False
7    38.0     1740      NaN       False         True
8    39.0     2040      NaN       False         True
9    40.0     2100   2100.0        True         True
10   50.0     2160      NaN       False         True
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading