Generate a Subset of Combinations in R

February 10, 2024

I wish to generate a data set in R mimicking the responses to a 5-variable (x1, x2, x3, x4, x5), 5-level data set (1,2,3,4,5).

I’d like the data set have around n = 15000 responses, and to be characterised by around 75% of the total possible combinations.

Therefore, approximately 75% of 5^5 = 3125 should be covered in the data set of around n = 15000 observations.

Would anyone be able to show me how this can be executed in R, please?

>Solution :

set.seed(42)
data.frame(obs = 1:15000, 
           q = rep(paste0("x",1:5), each = 15000),
           level = sample(1:5, 15000*5, TRUE)) |>
  pivot_wider(names_from = q, values_from = level)

Produces

# A tibble: 15,000 × 6
     obs    x1    x2    x3    x4    x5
   <int> <int> <int> <int> <int> <int>
 1     1     1     4     4     5     1
 2     2     5     4     3     5     4
 3     3     1     5     4     3     4
 4     4     1     3     5     4     5
 5     5     2     5     5     3     5
 6     6     4     1     5     1     1
 7     7     2     5     2     1     4
 8     8     2     1     2     1     4
 9     9     1     2     3     3     3
10    10     4     5     1     3     5
# ℹ 14,990 more rows

…

We can add |> unite("combo", x1:x5, remove = FALSE) |> count(combo) to see this covers 3,094 of the possible level combinations, about what you expected.