Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Creating subsets of df using pandas groupby and getting a value based on a function

I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.

df
    Year     ID   A   Year 2      C
0    2020    12   0    2019       0
1    2020    12   0    2020       0 <-
2    2017    10   1    2017       0 <-
3    2017    10   0    2018       0
4    2019    6    0    2017       0
5    2019    6    1    2018       0 <-

I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.

df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))

This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

TYIA.

>Solution :

You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:

idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()

If need new column filled by boolean use Index.isin:

df['new'] = df.index.isin(idx)
print (df)
   Year  ID  A  Year 2  C    new
0  2020  12  0    2019  0  False
1  2020  12  0    2020  0   True
2  2017  10  1    2017  0   True
3  2017  10  0    2018  0  False
4  2019   6  0    2017  0  False
5  2019   6  1    2018  0   True

If need filter rows use DataFrame.loc:

df1 = df.loc[idx]
print (df1)
   Year  ID  A  Year 2  C
5  2019   6  1    2018  0
2  2017  10  1    2017  0
1  2020  12  0    2020  0

One row solution:

df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading