pandas – show N highest counts in a group-by dataframe

December 23, 2021

Here is my input DataFrame

df = pd.DataFrame([
 ['CA', 'LAX', 'Alice'],
 ['NY', 'NYC', 'Alice'],
 ['TX', 'HOU', 'Alice'],
 ['CA', 'SFO', 'Bob'], 
 ['NY', 'NYC', 'Bob'],
 ['TX', 'AUS', 'Bob'], 
 ['CA', 'LAX', 'Carol'],
 ['NY', 'ALB', 'Carol'], 
 ['TX', 'HOU', 'Carol'],
 ['CA', 'LAX', 'Dan'],
 ['NY', 'NYC', 'Dan'],
 ['TX', 'DAL', 'Dan'],
 ['CA', 'SFO', 'Eve'],
 ['NY', 'NYC', 'Eve'],
 ['TX', 'AUS', 'Eve'], 
 ['CA', 'SDO', 'Frank'],
 ['NY', 'NYC', 'Frank'],
 ['TX', 'HOU', 'Frank']], 
 columns=['State','City','Person'])

And here is the output I wish to get:

State City  Count       
CA    LAX    3
      SFO    2
NY    ALB    1
      NYC    5
TX    HOU    3
      AUS    2

Basically, I want to show the top 2 cities having the highest count of people, for each state.

I have tried using this groupby statement:
df.groupby(['State','City'])['Person'].count()

which generates below output

State  City         
CA     LAX    3
       SDO    1
       SFO    2
NY     ALB    1
       NYC    5
TX     AUS    2
       DAL    1
       HOU    3

But I am unable to come up with statements which could sort the above counts and pick the highest two counts. Would really appreciate the help!

>Solution :

I think you can simplify the approach.

You can use groupby.value_counts(), instead of count(), to get where you are, and then chain another groupby().head(2) to get the top 2 Cities per each state.

The reason is that value_counts() resulting object will be in descending order, so that the first element is the most frequently-occurring element, so you can use head(2) as it is.

df.groupby('State')['City'].value_counts().groupby('State').head(2)

State  City
CA     LAX     3
       SFO     2
NY     NYC     5
       ALB     1
TX     HOU     3
       AUS     2
Name: City, dtype: int64