I have a dataframe like as shown below
id,Name,country,amount,qty
1,ABC,USA,123,4500
1,ABC,USA,156,3210
1,BCE,USA,687,2137
1,DEF,UK,456,1236
1,ABC,nan,216,324
1,DEF,nan,12678,11241
1,nan,nan,637,213
1,BCE,nan,213,543
1,XYZ,KOREA,432,321
1,XYZ,AUS,231,321
sf = pd.read_clipboard(sep=',')
I would like to do the below
a) Get top 3 based on amount for each id and other selected columns such as Name and country. Meaning, we get top 3 based id and Name first and later, we again get top 3 based on id and country
b) Find out how much does each of the top 3 item contribute to total amount for each unique id.
So, I tried the below
sf_name = sf.groupby(['id','Name'],dropna=False)['amount'].sum().nlargest(3).reset_index().rename(columns={'amount':'Name_amount'})
sf_country = sf.groupby(['id','country'],dropna=False)['amount'].sum().nlargest(3).reset_index().rename(columns={'amount':'country_amount'})
sf_name['total'] = sf.groupby('id')['amount'].sum()
sf_country['total'] = sf.groupby('id')['amount'].sum()
sf_name['name_pct_total'] = (sf_name['Name_amount']/sf_name['total'])*100
sf_country['country_pct_total'] = (sf_country['country_amount']/sf_country['total'])*100
As you can see, I am repeating the same operation for each column.
But in my real dataframe, I have to do this groupby id and find Top3 and compute pct_total % for another 8 columns (along with Name and country)
Is there any efficient, elegant and scalable solution that you can share?
I expect my output to be like as below
>Solution :
Use:
dfs = []
cols = ['Name','country']
for c in cols:
df = sf.groupby(['id',c],dropna=False)['amount'].sum().nlargest(3).reset_index(level=1, name=f'{c}_amount')
df[f'{c}_pct_total'] = df[f'{c}_amount'].div(df.groupby('id')[f'{c}_amount'].transform('sum'))*100
dfs.append(df)
df = pd.concat(dfs, axis=1)
print (df)
Name Name_amount Name_pct_total country country_amount \
id
1 DEF 13134 89.365177 NaN 13744
1 BCE 900 6.123699 USA 966
1 XYZ 663 4.511125 UK 456
country_pct_total
id
1 90.623764
1 6.369511
1 3.006726
