Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Averaging across columns based on their names in dataframe in R

I was wondering if there is an efficient way to take the average of each set of similarly named columns whose names end in ...1S and ...2S (ex: ex1S,ex2S) at time==1 and take the average of each set of similarly named columns whose names end in ...1C or ...2C (ex: ex1C,ex2C) at time==2 to achieve my Desired_output below?

Note: In my actual Data, I have several columns ending in ...1S and ...1C to average across, so, a functional answer is much appreciated.

My current inefficient solution is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(tidyverse)
Data %>%
mutate(ave_ex = case_when(
  time == 1 ~ mean(c(ex1S, ex2S)),
  time == 2 ~ mean(c(ex1C, ex2C))
), 
ave_id = case_when(
  time == 1 ~ mean(c(id1S, id2S)),
  time == 2 ~ mean(c(id1C, id2C))
)) %>% select(-c(ex1S:id2C))
Data = read.table(text = "

order DV    score  time  ex1S  ex2S  ex1C  ex2C  id1S  id2S  id1C  id2C     k     t
s-c   ac        1     1     8     5     6     1     2     4     3     7   400    30
s-c   bc        2     1     8     5     6     1     2     4     3     7   400    30
s-c   ac        3     2     8     5     6     1     2     4     3     7   600    50
s-c   bc        4     2     8     5     6     1     2     4     3     7   600    50

", header = TRUE)
Desired_output = "

order time DV score   k   t  ave_ex        ave_id
s-c   1    ac 1      400  30 (8+5)/2 =6.5  (2+4)/2 =3
s-c   1    bc 2      400  30 (8+5)/2 =6.5  (2+4)/2 =3
s-c   2    ac 3      600  50 (6+1)/2 =3.5  (3+7)/2 =5
s-c   2    bc 4      600  50 (6+1)/2 =3.5  (3+7)/2 =5

"

where ave_ = average i.e., mean().

>Solution :

If we need to automate, we may need a matching with the time and the C, S category. Below, we select only the ‘ex’, ‘id’ columns, then split, the data into a list of datasets having matching substring of column names i.e. without the digits (str_remove), loop over the list with map, get the mean, enframe to a two column tibble, create time column based on the substring in name, reshape to wide with pivot_wider and use the time column to join with original Data and select the columns of interest

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
 Data %>% 
  dplyr::select(starts_with("ex"), starts_with("id")) %>% 
  split.default(str_remove(names(.), "\\d+")) %>% 
  map(~ mean(rowMeans(.x))) %>% 
     enframe %>% 
     unnest(value) %>%
     mutate(time = c(1, 2)[str_detect(name, "C") + 1], 
     name = str_c("ave_", str_remove(name, "[A-Z]+"))) %>% 
   pivot_wider(names_from = name, values_from = value) %>% 
   left_join(Data, .) %>% 
   dplyr::select(-starts_with("ex"), -starts_with("id"))

-output

   order DV score time   k  t ave_ex ave_id
1   s-c ac     1    1 400 30    6.5      3
2   s-c bc     2    1 400 30    6.5      3
3   s-c ac     3    2 600 50    3.5      5
4   s-c bc     4    2 600 50    3.5      5
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading