I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
>Solution :
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75