Here is my input DataFrame
df = pd.DataFrame([
['CA', 'LAX', 'Alice'],
['NY', 'NYC', 'Alice'],
['TX', 'HOU', 'Alice'],
['CA', 'SFO', 'Bob'],
['NY', 'NYC', 'Bob'],
['TX', 'AUS', 'Bob'],
['CA', 'LAX', 'Carol'],
['NY', 'ALB', 'Carol'],
['TX', 'HOU', 'Carol'],
['CA', 'LAX', 'Dan'],
['NY', 'NYC', 'Dan'],
['TX', 'DAL', 'Dan'],
['CA', 'SFO', 'Eve'],
['NY', 'NYC', 'Eve'],
['TX', 'AUS', 'Eve'],
['CA', 'SDO', 'Frank'],
['NY', 'NYC', 'Frank'],
['TX', 'HOU', 'Frank']],
columns=['State','City','Person'])
And here is the output I wish to get:
State City Count
CA LAX 3
SFO 2
NY ALB 1
NYC 5
TX HOU 3
AUS 2
Basically, I want to show the top 2 cities having the highest count of people, for each state.
I have tried using this groupby statement:
df.groupby(['State','City'])['Person'].count()
which generates below output
State City
CA LAX 3
SDO 1
SFO 2
NY ALB 1
NYC 5
TX AUS 2
DAL 1
HOU 3
But I am unable to come up with statements which could sort the above counts and pick the highest two counts. Would really appreciate the help!
>Solution :
I think you can simplify the approach.
You can use groupby.value_counts(), instead of count(), to get where you are, and then chain another groupby().head(2) to get the top 2 Cities per each state.
The reason is that value_counts() resulting object will be in descending order, so that the first element is the most frequently-occurring element, so you can use head(2) as it is.
df.groupby('State')['City'].value_counts().groupby('State').head(2)
State City
CA LAX 3
SFO 2
NY NYC 5
ALB 1
TX HOU 3
AUS 2
Name: City, dtype: int64