I wish to generate a data set in R mimicking the responses to a 5-variable (x1, x2, x3, x4, x5), 5-level data set (1,2,3,4,5).
I’d like the data set have around n = 15000 responses, and to be characterised by around 75% of the total possible combinations.
Therefore, approximately 75% of 5^5 = 3125 should be covered in the data set of around n = 15000 observations.
Would anyone be able to show me how this can be executed in R, please?
>Solution :
set.seed(42)
data.frame(obs = 1:15000,
q = rep(paste0("x",1:5), each = 15000),
level = sample(1:5, 15000*5, TRUE)) |>
pivot_wider(names_from = q, values_from = level)
Produces
# A tibble: 15,000 × 6
obs x1 x2 x3 x4 x5
<int> <int> <int> <int> <int> <int>
1 1 1 4 4 5 1
2 2 5 4 3 5 4
3 3 1 5 4 3 4
4 4 1 3 5 4 5
5 5 2 5 5 3 5
6 6 4 1 5 1 1
7 7 2 5 2 1 4
8 8 2 1 2 1 4
9 9 1 2 3 3 3
10 10 4 5 1 3 5
# ℹ 14,990 more rows
…
We can add |> unite("combo", x1:x5, remove = FALSE) |> count(combo) to see this covers 3,094 of the possible level combinations, about what you expected.