Averaging across columns based on their names in dataframe in R

December 28, 2022

I was wondering if there is an efficient way to take the average of each set of similarly named columns whose names end in ...1S and ...2S (ex: ex1S,ex2S) at time==1 and take the average of each set of similarly named columns whose names end in ...1C or ...2C (ex: ex1C,ex2C) at time==2 to achieve my Desired_output below?

Note: In my actual Data, I have several columns ending in ...1S and ...1C to average across, so, a functional answer is much appreciated.

My current inefficient solution is:

library(tidyverse)
Data %>%
mutate(ave_ex = case_when(
  time == 1 ~ mean(c(ex1S, ex2S)),
  time == 2 ~ mean(c(ex1C, ex2C))
), 
ave_id = case_when(
  time == 1 ~ mean(c(id1S, id2S)),
  time == 2 ~ mean(c(id1C, id2C))
)) %>% select(-c(ex1S:id2C))

Data = read.table(text = "

order DV    score  time  ex1S  ex2S  ex1C  ex2C  id1S  id2S  id1C  id2C     k     t
s-c   ac        1     1     8     5     6     1     2     4     3     7   400    30
s-c   bc        2     1     8     5     6     1     2     4     3     7   400    30
s-c   ac        3     2     8     5     6     1     2     4     3     7   600    50
s-c   bc        4     2     8     5     6     1     2     4     3     7   600    50

", header = TRUE)

Desired_output = "

order time DV score   k   t  ave_ex        ave_id
s-c   1    ac 1      400  30 (8+5)/2 =6.5  (2+4)/2 =3
s-c   1    bc 2      400  30 (8+5)/2 =6.5  (2+4)/2 =3
s-c   2    ac 3      600  50 (6+1)/2 =3.5  (3+7)/2 =5
s-c   2    bc 4      600  50 (6+1)/2 =3.5  (3+7)/2 =5

"

where ave_ = average i.e., mean().

>Solution :

If we need to automate, we may need a matching with the time and the C, S category. Below, we select only the ‘ex’, ‘id’ columns, then split, the data into a list of datasets having matching substring of column names i.e. without the digits (str_remove), loop over the list with map, get the mean, enframe to a two column tibble, create time column based on the substring in name, reshape to wide with pivot_wider and use the time column to join with original Data and select the columns of interest

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
 Data %>% 
  dplyr::select(starts_with("ex"), starts_with("id")) %>% 
  split.default(str_remove(names(.), "\\d+")) %>% 
  map(~ mean(rowMeans(.x))) %>% 
     enframe %>% 
     unnest(value) %>%
     mutate(time = c(1, 2)[str_detect(name, "C") + 1], 
     name = str_c("ave_", str_remove(name, "[A-Z]+"))) %>% 
   pivot_wider(names_from = name, values_from = value) %>% 
   left_join(Data, .) %>% 
   dplyr::select(-starts_with("ex"), -starts_with("id"))

-output

   order DV score time   k  t ave_ex ave_id
1   s-c ac     1    1 400 30    6.5      3
2   s-c bc     2    1 400 30    6.5      3
3   s-c ac     3    2 600 50    3.5      5
4   s-c bc     4    2 600 50    3.5      5