Select top n groups in pandas dataframe

October 31, 2023

I have the following dataframe:

    Country Crop    Harvest Year    Area (ha)
Afghanistan Maize   2019    94910
Afghanistan Maize   2020    140498
Afghanistan Maize   2021    92144
Afghanistan Winter Wheat    2019    2334000
Afghanistan Winter Wheat    2020    2668000
Afghanistan Winter Wheat    2021    1833357
Argentina   Maize   2019    7232761
Argentina   Maize   2020    7730506
Argentina   Maize   2021    8146596
Argentina   Winter Wheat    2019    6050953
Argentina   Winter Wheat    2020    6729838
Argentina   Winter Wheat    2021    6394102
China   Maize   2019    41309740
China   Maize   2020    41292000
China   Maize   2021    43355859
China   Winter Wheat    2019    23732560
China   Winter Wheat    2020    23383000
China   Winter Wheat    2021    23571400
Ethiopia    Maize   2019    2274306
Ethiopia    Maize   2020    2363507
Ethiopia    Maize   2021    2530000
Ethiopia    Winter Wheat    2019    1789372
Ethiopia    Winter Wheat    2020    1829051
Ethiopia    Winter Wheat    2021    1950000
France  Maize   2019    1506100
France  Maize   2020    1691130
France  Maize   2021    1549520
France  Winter Wheat    2019    5244250
France  Winter Wheat    2020    4512420
France  Winter Wheat    2021    5276730
India   Maize   2019    9027130
India   Maize   2020    9569060
India   Maize   2021    9860000
India   Winter Wheat    2019    29318780
India   Winter Wheat    2020    31357020
India   Winter Wheat    2021    31610000
Namibia Maize   2019    21123
Namibia Maize   2020    35000
Namibia Maize   2021    46070
Namibia Winter Wheat    2019    1079
Namibia Winter Wheat    2020    2000
Namibia Winter Wheat    2021    3026

I want to select the top 2 countries by the average value of Area (ha) column across the `Harvest Year’s. I tried this but it does not work:

df = df.groupby("Crop", dropna=False).apply( lambda x: x.nlargest(2, "Area (ha)") )

Output should be, here China and india are the countries with the largest average Area (ha) for both maize and Winter Wheat, but in the full datasets different countries would have largest values for different crops:

Country Crop    Harvest Year    Area (ha)
China   Maize   2019    41309740
China   Maize   2020    41292000
China   Maize   2021    43355859
China   Winter Wheat    2019    23732560
China   Winter Wheat    2020    23383000
China   Winter Wheat    2021    23571400
India   Maize   2019    9027130
India   Maize   2020    9569060
India   Maize   2021    9860000
India   Winter Wheat    2019    29318780
India   Winter Wheat    2020    31357020
India   Winter Wheat    2021    31610000

>Solution :

IIUC, you can do double .groupby:

x = (
    df.groupby("Crop")
    .apply(lambda x: x.groupby("Country")["Area (ha)"].mean())
    .stack()
    .groupby(level=0, group_keys=False)
    .nlargest(2)
)

print(x)

Prints top 2 Crop/Countries by average area:

Crop          Country
Maize         China      4.198587e+07
              India      9.485397e+06
Winter Wheat  India      3.076193e+07
              China      2.356232e+07
dtype: float64

Then you can use this index to filter the original dataframe:

out = df.set_index(["Crop", "Country"]).loc[x.index].reset_index()
print(out)

Prints:

            Crop Country  Harvest Year  Area (ha)
0          Maize   China          2019   41309740
1          Maize   China          2020   41292000
2          Maize   China          2021   43355859
3          Maize   India          2019    9027130
4          Maize   India          2020    9569060
5          Maize   India          2021    9860000
6   Winter Wheat   India          2019   29318780
7   Winter Wheat   India          2020   31357020
8   Winter Wheat   India          2021   31610000
9   Winter Wheat   China          2019   23732560
10  Winter Wheat   China          2020   23383000
11  Winter Wheat   China          2021   23571400