I have a file with the following columns: ‘id’,’H1′,’H2′,’H3′,’H4′,’H5′,’H6′,’H7′,’H8′,’H9′,’H10′,’H11′,’H12′,’H13′,’H14′,’H15′,’H16′,’H17′,’H18′,’H19′,’H20′,’H21′,’H22′,’H23′,’H24′
I want to sum the values of columns but grouping by id and keep only the rows of the unique id (i.e., 1,2,3,4,5,6)
I tried the following but the sum does not work.
col_names=['id','H1','H2','H3','H4','H5','H6','H7','H8','H9','H10','H11','H12','H13','H14','H15','H16','H17','H18','H19','H20','H21','H22','H23','H24']
df = pd.read_csv(i,sep='\t',names=col_names)
cols = df.filter(regex='^H').columns
df.groupby(['id'])[cols].sum()
df_final = df.sort_index(by=['id'], ascending=[True])
df_final.to_csv(outfile,sep='\t',index=False,header=False)
>Solution :
You just need to assign the result of the groupby operation to a new variable in order to keep the aggregated data.
Also, it’s better using the sort_values method instead of the deprecated sort_index method to sort. you can do it in this way
import pandas as pd
col_names=['id','H1','H2','H3','H4','H5','H6','H7','H8','H9','H10','H11','H12','H13','H14','H15','H16','H17','H18','H19','H20','H21','H22','H23','H24']
df = pd.read_csv(i,sep='\t',names=col_names)
cols = df.filter(regex='^H').columns
grouped = df.groupby(['id'])[cols].sum()
unique_ids = [1,2,3,4,5,6]
grouped = grouped.loc[unique_ids]
grouped = grouped.sort_values(by='id')
grouped.to_csv(outfile,sep='\t',index=False,header=False)
