Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using pivot_longer with hunderds of variables

I have large dataset in wide format (1200 obs. of 287 variables) where most variables are time-varying over a 10 years range and a few are constant over time. Time-varying variable labels do not follow a particular pattern except for ending with a year ranging from 2011 to 2020.

I would like to reshape this in a long format but I’m having a very hard time figuring out how to do it in a reasonably quick way.

Here is a simple data frame with comparable characteristics:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#Packages
library(dplyr)
library(tidyr)

#Reproducibility
set.seed(123)

#Generate data frame
wide <- as.data.frame(matrix(runif(n=60, min=0, max=10), nrow=5))

#Naming rows and columns
names(wide) = c("day2011", 
               "night2011", 
               "in_between2011",
               "day2012", 
               "night2012", 
               "in_between2012",
               "day2013", 
               "night2013", 
               "in_between2013",
               "house",
               "tree",
               "dog"
               )
wide$id <-  c("red", "blue", "yellow", "black", "green")

I’m slowly working my way to it doing this, but I’m not sure it is sustainable on a large dataset:

long_day <- pivot_longer(data = wide,
                     cols = starts_with("day"),
                     #cols_vary = "slowest",
                     names_to = "year",
                     values_to = "Day") %>%
  select(-(night2011:in_between2013)) %>%
  mutate(year = gsub("day", "", year))
      
long_night <- pivot_longer(data = wide,
                         cols = starts_with("night"),
                         #cols_vary = "slowest",
                         names_to = "year",
                         values_to = "Night") %>%
  select(-(day2011:in_between2013))%>%
  mutate(year = gsub("night", "", year))   

long_24 <- full_join(long_day, long_night, by = c("id", "year", "house", "dog", "tree"))

>Solution :

You need to write a regex that captures the time variables. The pattern is that they run from 2011-2020. Capture all variables that have those years at the end as the varying variables:

wide %>%
   pivot_longer(matches('20(1[1-9]|20)$'),
                names_to = c('.value', 'year'),
                names_pattern = "(.*?)(\\d+)")

# A tibble: 15 × 8
   house  tree   dog id     year    day night in_between
   <dbl> <dbl> <dbl> <chr>  <chr> <dbl> <dbl>      <dbl>
 1  1.39 0.458  2.07 red    2011  2.88  0.456       9.57
 2  1.39 0.458  2.07 red    2012  9.00  8.90        7.09
 3  1.39 0.458  2.07 red    2013  9.63  4.78        1.43
 4  2.33 4.42   1.28 blue   2011  7.88  5.28        4.53
 5  2.33 4.42   1.28 blue   2012  2.46  6.93        5.44
 6  2.33 4.42   1.28 blue   2013  9.02  7.58        4.15
 7  4.66 7.99   7.53 yellow 2011  4.09  8.92        6.78
 8  4.66 7.99   7.53 yellow 2012  0.421 6.41        5.94
 9  4.66 7.99   7.53 yellow 2013  6.91  2.16        4.14
10  2.66 1.22   8.95 black  2011  8.83  5.51        5.73
11  2.66 1.22   8.95 black  2012  3.28  9.94        2.89
12  2.66 1.22   8.95 black  2013  7.95  3.18        3.69
13  8.58 5.61   3.74 green  2011  9.40  4.57        1.03
14  8.58 5.61   3.74 green  2012  9.55  6.56        1.47
15  8.58 5.61   3.74 green  2013  0.246 2.32        1.52
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading