I want to take a set of data, group it by one column, order it by another, and then count how many subsequent instances of a certain event happen. For example, in the data below… I want to add a column, call it nSubsqX, that tells me at each row, for that customer, how many subsequent orders have Product "X". Row 1 should result in 3 because rows 13:15 all are Customer C, Product X; Row 9 should result in 0 since there are no subsequent orders by Customer A.
Date Customer Product
1 2020-01-31 C X
2 2020-02-10 B X
3 2020-02-12 B Y
4 2020-03-04 B Z
5 2020-03-29 B X
6 2020-04-08 B X
7 2020-04-30 B X
8 2020-05-13 B X
9 2020-05-18 A X
10 2020-05-23 B Y
11 2020-07-02 B Y
12 2020-08-26 B Y
13 2020-09-19 C X
14 2020-10-13 C X
15 2020-11-11 C X
16 2020-12-06 B X
17 2020-12-26 C Y
For the purpose of providing a Reprex, below is the code to create the data frame.
df = data.frame("Date" = as.Date(c("2020-01-31", "2020-02-10", "2020-02-12",
"2020-03-04", "2020-03-29", "2020-04-08", "2020-04-30", "2020-05-13", "2020-05-18",
"2020-05-23", "2020-07-02", "2020-08-26", "2020-09-19", "2020-10-13", "2020-11-11",
"2020-12-06", "2020-12-26")), "Customer" = c("C","B","B","B","B","B","B","B","A",
"B","B","B","C","C","C","B","C"), "Product" = c("X","X","Y","Z","X","X","X","X","X",
"Y","Y","Y","X","X","X","X","Y"))
I expect that I will need some sort of mutate function, but I can’t quite get it right, I’ve tried:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X")]))
Which gives the TOTAL of times "X" shows up, but what I want is the number subsequently to this one. I also tried:
df2 = df %>%
group_by(Customer) %>%
arrange(Customer, Date) %>%
mutate(
nSubsqX = length(Customer[which(Product == "X" & Date > Date)]))
Which just returns 0 probably because Date > Date doesn’t make any sense. I need a way to say Date > THIS Date. The solution I am trying to achieve is shown below:
Date Customer Product nSubsqX
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
I think this is just an issue of not even knowing what words to search, so I’m sure there’s something out there that would tell me what to do if I could figure out the right search criteria. I appreciate anyone pointing me in the right direction.
Thanks!
>Solution :
Here is one option with tidyverse – arrange by ‘Customer’, ‘Date’, then grouped by ‘Customer’, replace the vector of NA elements, where the ‘Product’ is ‘X’ with reverse sequence of the count of ‘X’ values, then we either use tidyr::fill or can use zoo::na.locf0 to fill the NA elements with previous non-NA values
library(dplyr)
df %>%
arrange(Customer, Date) %>%
group_by(Customer) %>%
mutate(new = zoo::na.locf0(replace(rep(NA_real_, n()),
Product == "X", rev(seq_len(sum(Product == "X")))-1))) %>%
ungroup
-output
# A tibble: 17 Ă— 4
Date Customer Product new
<date> <chr> <chr> <dbl>
1 2020-05-18 A X 0
2 2020-02-10 B X 5
3 2020-02-12 B Y 5
4 2020-03-04 B Z 5
5 2020-03-29 B X 4
6 2020-04-08 B X 3
7 2020-04-30 B X 2
8 2020-05-13 B X 1
9 2020-05-23 B Y 1
10 2020-07-02 B Y 1
11 2020-08-26 B Y 1
12 2020-12-06 B X 0
13 2020-01-31 C X 3
14 2020-09-19 C X 2
15 2020-10-13 C X 1
16 2020-11-11 C X 0
17 2020-12-26 C Y 0
Similar option can be done with data.table
library(data.table)
setDT(df)[order(Customer, Date)][Product == "X",
nSubsqx := rev(seq_len(.N)) - 1, Customer][,
nSubsqx := nafill(nSubsqx, "locf"), Customer][]
-output
ndex: <Product>
Date Customer Product nSubsqx
<Date> <char> <char> <num>
1: 2020-05-18 A X 0
2: 2020-02-10 B X 5
3: 2020-02-12 B Y 5
4: 2020-03-04 B Z 5
5: 2020-03-29 B X 4
6: 2020-04-08 B X 3
7: 2020-04-30 B X 2
8: 2020-05-13 B X 1
9: 2020-05-23 B Y 1
10: 2020-07-02 B Y 1
11: 2020-08-26 B Y 1
12: 2020-12-06 B X 0
13: 2020-01-31 C X 3
14: 2020-09-19 C X 2
15: 2020-10-13 C X 1
16: 2020-11-11 C X 0
17: 2020-12-26 C Y 0