I’m wondering if there is a way to group a data frame based on two variables and then add new rows if a certain string, such as a name, is not present within the grouped data.
I have a list of dummy names (name_list) that I’d like to ensure are included in the grouped data irrespective of whether they were there in the first place or not (based on the name column). For example, I’d like to group by date and drill, but you’ll note Person A and Person B are missing from 2023-01-01/Activity 1. I’d like to make sure every name from name_list is included for every combination of date/drill, and if names are added, they are assigned a duration of 0 to essentially represent they were not present for that date/drill combination. Hope that makes sense. Thanks.
library(tidyverse)
# List of names to be joined within grouping variables.
name_list <- c(paste("Person", LETTERS[1:8]))
set.seed(10)
name <- c(name_list[3:8], name_list[1:6], name_list[2:7], name_list[3:8])
date <- rep(seq(as.Date('2023/01/01'), as.Date('2023/01/02'), by = "day"),
each = 12)
drill <- rep(paste("Activity", 1:2), each = 6, times = 2)
duration <- rep(c(5, 8), each = 6, times = 2)
df <- data.frame(name, date, drill, duration)
name date drill duration
1 Person C 2023-01-01 Activity 1 5
2 Person D 2023-01-01 Activity 1 5
3 Person E 2023-01-01 Activity 1 5
4 Person F 2023-01-01 Activity 1 5
5 Person G 2023-01-01 Activity 1 5
6 Person H 2023-01-01 Activity 1 5
7 Person A 2023-01-01 Activity 2 8
8 Person B 2023-01-01 Activity 2 8
9 Person C 2023-01-01 Activity 2 8
10 Person D 2023-01-01 Activity 2 8
11 Person E 2023-01-01 Activity 2 8
12 Person F 2023-01-01 Activity 2 8
13 Person B 2023-01-02 Activity 1 5
14 Person C 2023-01-02 Activity 1 5
15 Person D 2023-01-02 Activity 1 5
16 Person E 2023-01-02 Activity 1 5
17 Person F 2023-01-02 Activity 1 5
18 Person G 2023-01-02 Activity 1 5
19 Person C 2023-01-02 Activity 2 8
20 Person D 2023-01-02 Activity 2 8
21 Person E 2023-01-02 Activity 2 8
22 Person F 2023-01-02 Activity 2 8
23 Person G 2023-01-02 Activity 2 8
24 Person H 2023-01-02 Activity 2 8
>Solution :
I think you are looking for the complete() function.
EDIT:
As mentioned by @LMc, using complete only works, if all the values of name already exist in the data. Setting the namecolumn to a factor with all possible names of name_list solves this issue.
df %>%
mutate(name = factor(name, levels = name_list)) %>%
complete(name, date, drill, fill = list(duration = 0))
# A tibble: 32 × 4
name date drill duration
<fct> <date> <chr> <dbl>
1 Person A 2023-01-01 Activity 1 0
2 Person A 2023-01-01 Activity 2 8
3 Person A 2023-01-02 Activity 1 0
4 Person A 2023-01-02 Activity 2 0
5 Person B 2023-01-01 Activity 1 0
6 Person B 2023-01-01 Activity 2 8
7 Person B 2023-01-02 Activity 1 5
8 Person B 2023-01-02 Activity 2 0
9 Person C 2023-01-01 Activity 1 5
10 Person C 2023-01-01 Activity 2 8
11 Person C 2023-01-02 Activity 1 5
12 Person C 2023-01-02 Activity 2 8
13 Person D 2023-01-01 Activity 1 5
14 Person D 2023-01-01 Activity 2 8
15 Person D 2023-01-02 Activity 1 5
16 Person D 2023-01-02 Activity 2 8
17 Person E 2023-01-01 Activity 1 5
18 Person E 2023-01-01 Activity 2 8
19 Person E 2023-01-02 Activity 1 5
20 Person E 2023-01-02 Activity 2 8
21 Person F 2023-01-01 Activity 1 5
22 Person F 2023-01-01 Activity 2 8
23 Person F 2023-01-02 Activity 1 5
24 Person F 2023-01-02 Activity 2 8
25 Person G 2023-01-01 Activity 1 5
26 Person G 2023-01-01 Activity 2 0
27 Person G 2023-01-02 Activity 1 5
28 Person G 2023-01-02 Activity 2 8
29 Person H 2023-01-01 Activity 1 5
30 Person H 2023-01-01 Activity 2 0
31 Person H 2023-01-02 Activity 1 0
32 Person H 2023-01-02 Activity 2 8