Pandas groupby: get max value in a subgroup

August 4, 2022

I have a large dataset grouped by column, row, year, potveg, and total. I am trying to get the max value of ‘total’ column in a specific year of a group. i.e., for the dataset below;

col      row    year    potveg  total

-125.0  42.5    2015    9       697.3
                2015    13      535.2
                2015    15      82.3
                2016    9       907.8
                2016    13      137.6
                2016    15      268.4
                2017    9       961.9
                2017    13      74.2
                2017    15      248.0
                2018    9       937.9
                2018    13      575.6
                2018    15      215.5
-135.0  70.5    2015    8       697.3
                2015    10      535.2
                2015    19      82.3
                2016    8       907.8
                2016    10      137.6
                2016    19      268.4
                2017    8       961.9
                2017    10      74.2
                2017    19      248.0
                2018    8       937.9
                2018    10      575.6
                2018    19      215.5

I want my output to look like this:

col      row    year    potveg  total

-125.0  42.5    2015    9       697.3
                2016    9       907.8
                2017    9       961.9
                2018    9       937.9
-135.0  70.5    2015    8       697.3
                2016    8       907.8
                2017    8       961.9
                2018    8       937.9

I tried

df.groupby(['col','row','year','potveg']).agg({'total':'max'})

and

df.groupby(['col','row','year','potveg'])['total'].max()

but those don’t seem to work because it outputs the same results.
I think the issue is the ‘potveg’ column which is a subgroup. I am not sure how to select rows containing max value of ‘total’.

>Solution :

One possible solution, using .idxmax() inside groupby.apply:

print(
    df.groupby(["col", "row", "year"], as_index=False, sort=False).apply(
        lambda x: x.loc[x["total"].idxmax()]
    )
)

Prints:

     col   row    year  potveg  total
0 -125.0  42.5  2015.0     9.0  697.3
1 -125.0  42.5  2016.0     9.0  907.8
2 -125.0  42.5  2017.0     9.0  961.9
3 -125.0  42.5  2018.0     9.0  937.9
4 -135.0  70.5  2015.0     8.0  697.3
5 -135.0  70.5  2016.0     8.0  907.8
6 -135.0  70.5  2017.0     8.0  961.9
7 -135.0  70.5  2018.0     8.0  937.9

DataFrame used:

col	row	year	potveg	total
-125	42.5	2015	9	697.3
-125	42.5	2015	13	535.2
-125	42.5	2015	15	82.3
-125	42.5	2016	9	907.8
-125	42.5	2016	13	137.6
-125	42.5	2016	15	268.4
-125	42.5	2017	9	961.9
-125	42.5	2017	13	74.2
-125	42.5	2017	15	248
-125	42.5	2018	9	937.9
-125	42.5	2018	13	575.6
-125	42.5	2018	15	215.5
-135	70.5	2015	8	697.3
-135	70.5	2015	10	535.2
-135	70.5	2015	19	82.3
-135	70.5	2016	8	907.8
-135	70.5	2016	10	137.6
-135	70.5	2016	19	268.4
-135	70.5	2017	8	961.9
-135	70.5	2017	10	74.2
-135	70.5	2017	19	248
-135	70.5	2018	8	937.9
-135	70.5	2018	10	575.6
-135	70.5	2018	19	215.5