I have two large objects I got from JSON downloads. In raw form, order is important.
Length(species) is the same as length(colors). I’m happy to add another column or use row names. From list_rbind(), rows_to is perfect, but…
- list_rbind() doesn’t work, because colors is not a list of dfs.
- as.data.frame() doesn’t work, because colors has lists of varying lengths
- unlist() loses information
species <- c("roses", "tulips", "lilies")
colors <- list(list("red"), list("white", "yellow"), list("pink", "white"))
Desired result
species colors
1 roses red
2 tulips white
3 tulips yellow
4 lilies pink
5 lilies white
I can brute-force the desired result using a for loop, but I’ve got a million species, each with a minimum of one color and an average of eight. So a smarter and faster approach is needed. And no, the real-world data is not character strings. I need a smarter approach.
unnest list of lists of different lengths to dataframe does not seem to address my challenge.
Real world
> str(pg, max.level = 2)
'data.frame': 1206169 obs. of 3 variables:
$ PG : int 1 2 3 4 5 6 7 8 9 10 ...
$ npi:List of 1206169
..$ : int 1376032029 1184159188 1629504501 1598703019 1487200408 1801443619
..$ : int 1588809248
..$ : int 1497791297
>Solution :
base R
data.frame(
species = rep(species, times = lengths(colors)),
colors = unlist(colors)
)
# species colors
# 1 roses red
# 2 tulips white
# 3 tulips yellow
# 4 lilies pink
# 5 lilies white
dplyr
library(dplyr)
tibble(species, colors) %>%
unnest(colors) %>%
mutate(colors = unlist(colors))
# # A tibble: 5 × 2
# species colors
# <chr> <chr>
# 1 roses red
# 2 tulips white
# 3 tulips yellow
# 4 lilies pink
# 5 lilies white
With a semblance of your real data:
dat <- data.frame(PG=1:3)
dat$npi <- list(c(1376032029L, 1184159188L, 1629504501L, 1598703019L, 1487200408L, 1801443619L), 1588809248L, 1497791297L)
str(dat)
# 'data.frame': 3 obs. of 2 variables:
# $ PG : int 1 2 3
# $ npi:List of 3
# ..$ : int 1376032029 1184159188 1629504501 1598703019 1487200408 1801443619
# ..$ : int 1588809248
# ..$ : int 1497791297
# base R
dat[,-2,drop=FALSE][rep(1:nrow(dat), times = lengths(dat$npi)),,drop=FALSE] |>
cbind(npi=unlist(dat$npi))
# PG npi
# 1 1 1376032029
# 1.1 1 1184159188
# 1.2 1 1629504501
# 1.3 1 1598703019
# 1.4 1 1487200408
# 1.5 1 1801443619
# 2 2 1588809248
# 3 3 1497791297
# dplyr
unnest(dat, npi)
# # A tibble: 8 × 2
# PG npi
# <int> <int>
# 1 1 1376032029
# 2 1 1184159188
# 3 1 1629504501
# 4 1 1598703019
# 5 1 1487200408
# 6 1 1801443619
# 7 2 1588809248
# 8 3 1497791297