Home How to sum the values of multiple rows but group by id with pandas in python?

Questions

How to sum the values of multiple rows but group by id with pandas in python?

May 8, 2023

I have a file with the following columns: ‘id’,’H1′,’H2′,’H3′,’H4′,’H5′,’H6′,’H7′,’H8′,’H9′,’H10′,’H11′,’H12′,’H13′,’H14′,’H15′,’H16′,’H17′,’H18′,’H19′,’H20′,’H21′,’H22′,’H23′,’H24′

see an example:

I want to sum the values of columns but grouping by id and keep only the rows of the unique id (i.e., 1,2,3,4,5,6)
I tried the following but the sum does not work.

 col_names=['id','H1','H2','H3','H4','H5','H6','H7','H8','H9','H10','H11','H12','H13','H14','H15','H16','H17','H18','H19','H20','H21','H22','H23','H24']
    df = pd.read_csv(i,sep='\t',names=col_names)
    
    cols = df.filter(regex='^H').columns
    
    df.groupby(['id'])[cols].sum()
    
    df_final = df.sort_index(by=['id'], ascending=[True])
    df_final.to_csv(outfile,sep='\t',index=False,header=False)

>Solution :

You just need to assign the result of the groupby operation to a new variable in order to keep the aggregated data.
Also, it’s better using the sort_values method instead of the deprecated sort_index method to sort. you can do it in this way

import pandas as pd

col_names=['id','H1','H2','H3','H4','H5','H6','H7','H8','H9','H10','H11','H12','H13','H14','H15','H16','H17','H18','H19','H20','H21','H22','H23','H24']

df = pd.read_csv(i,sep='\t',names=col_names)
cols = df.filter(regex='^H').columns

grouped = df.groupby(['id'])[cols].sum()

unique_ids = [1,2,3,4,5,6]
grouped = grouped.loc[unique_ids]

grouped = grouped.sort_values(by='id')
grouped.to_csv(outfile,sep='\t',index=False,header=False)