Wrangle data from long to wide format for cox regression in R

October 13, 2022

I am trying to wrangle some data for a cox regression…

#generate some data
set.seed(1)
ID <- sort(rep(1:10, times = 5))
conditions <- rep(c("asthma", "copd", "af", "cvd", "ckd"), times = 10)
day <- sample(1:100, 50)

#assign to dataframe
df <- data.frame(ID, conditions, day)

I have data in a long format like this:

ID	conditions	day
1	asthma	68
1	copd	39
1	af	1
1	cvd	34
1	ckd	87
2	asthma	43
2	copd	14
2	af	82
2	cvd	59
2	ckd	51

And I need it wrangled to this:
As you can see, ID=1 develops AF on day 1, cvd on day 34 and copd on day 39…
So assuming that this is in order of date…
In rownum 1, the af column changes to 1…
In rownum 2, the af AND cvd changes to 1…
In rownum3, the af AND cvd AND copd changes to 1…
Then it would be the same kind of pattern for all the other IDs.

rownum	ID	day	asthma	copd	af	cvd
1	1	1	0	0	1	0
2	1	34	0	0	1	1
3	1	39	0	1	1	1
4	1	68	1	1	1	1
5	2	14	0	1	0	0
6	2	43	1	1	0	0
7	2	51	1	1	0	1
…	…	…	…	…	…	…

I’ve tried using a lag function, but it just doesn’t work… the lag needs to work for multiple columns as you can see above.

dt[,temp:=ifelse(is.na(reglag(event_dt,1)), as.integer(0), reglag(event_dt,1)), by=ID]
dt[, sequence:=cumsum(temp)+1, by=ID]

func = function(x)
{
  which(c(1,lag(x,1)[-1]) %in% 1) %>%
    c(length(x)+1) %>% 
    diff
}

reglag = function(x,lag) {c(rep(NA,lag), x[lag:(length(x)-1)])}

dt[, cond.time:=func(event_dt) %>% lapply(seq) %>% unlist, by=ID]

Would be very grateful for any help you could give. I also have a massive table, so maybe a loop would cause me memory issues…

Many many thanks in advance
~R

>Solution :

Arrange by ID and day, and pivot_wider. You’ll get 1 for the disease at day d, 0 elsewhere. Use cumsum to add 1 to the consecutive values for each column.

library(dplyr)
library(tidyr)
df %>% 
  arrange(ID, day) %>% 
  mutate(value = 1) %>% 
  pivot_wider(names_from = conditions, values_fill = 0) %>% 
  group_by(ID) %>% 
  mutate(across(af:ckd, cumsum))

output

     ID   day    af   cvd  copd asthma   ckd
   <int> <int> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1     1     1     1     0     0      0     0
 2     1    34     1     1     0      0     0
 3     1    39     1     1     1      0     0
 4     1    68     1     1     1      1     0
 5     1    87     1     1     1      1     1
 6     2    14     0     0     1      0     0
 7     2    43     0     0     1      1     0
 8     2    51     0     0     1      1     1
 9     2    59     0     1     1      1     1
10     2    82     1     1     1      1     1
# … with 40 more rows