Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Wrangle data from long to wide format for cox regression in R

I am trying to wrangle some data for a cox regression…

#generate some data
set.seed(1)
ID <- sort(rep(1:10, times = 5))
conditions <- rep(c("asthma", "copd", "af", "cvd", "ckd"), times = 10)
day <- sample(1:100, 50)

#assign to dataframe
df <- data.frame(ID, conditions, day)

I have data in a long format like this:

ID conditions day
1 asthma 68
1 copd 39
1 af 1
1 cvd 34
1 ckd 87
2 asthma 43
2 copd 14
2 af 82
2 cvd 59
2 ckd 51

And I need it wrangled to this:
As you can see, ID=1 develops AF on day 1, cvd on day 34 and copd on day 39…
So assuming that this is in order of date…
In rownum 1, the af column changes to 1…
In rownum 2, the af AND cvd changes to 1…
In rownum3, the af AND cvd AND copd changes to 1…
Then it would be the same kind of pattern for all the other IDs.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

rownum ID day asthma copd af cvd
1 1 1 0 0 1 0
2 1 34 0 0 1 1
3 1 39 0 1 1 1
4 1 68 1 1 1 1
5 2 14 0 1 0 0
6 2 43 1 1 0 0
7 2 51 1 1 0 1

I’ve tried using a lag function, but it just doesn’t work… the lag needs to work for multiple columns as you can see above.

dt[,temp:=ifelse(is.na(reglag(event_dt,1)), as.integer(0), reglag(event_dt,1)), by=ID]
dt[, sequence:=cumsum(temp)+1, by=ID]

func = function(x)
{
  which(c(1,lag(x,1)[-1]) %in% 1) %>%
    c(length(x)+1) %>% 
    diff
}

reglag = function(x,lag) {c(rep(NA,lag), x[lag:(length(x)-1)])}

dt[, cond.time:=func(event_dt) %>% lapply(seq) %>% unlist, by=ID]

Would be very grateful for any help you could give. I also have a massive table, so maybe a loop would cause me memory issues…

Many many thanks in advance
~R

>Solution :

Arrange by ID and day, and pivot_wider. You’ll get 1 for the disease at day d, 0 elsewhere. Use cumsum to add 1 to the consecutive values for each column.

library(dplyr)
library(tidyr)
df %>% 
  arrange(ID, day) %>% 
  mutate(value = 1) %>% 
  pivot_wider(names_from = conditions, values_fill = 0) %>% 
  group_by(ID) %>% 
  mutate(across(af:ckd, cumsum))

output

     ID   day    af   cvd  copd asthma   ckd
   <int> <int> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1     1     1     1     0     0      0     0
 2     1    34     1     1     0      0     0
 3     1    39     1     1     1      0     0
 4     1    68     1     1     1      1     0
 5     1    87     1     1     1      1     1
 6     2    14     0     0     1      0     0
 7     2    43     0     0     1      1     0
 8     2    51     0     0     1      1     1
 9     2    59     0     1     1      1     1
10     2    82     1     1     1      1     1
# … with 40 more rows
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading