Home Counting previous occurrences of certain variable per group and storing as new column

Questions

Counting previous occurrences of certain variable per group and storing as new column

March 22, 2022

I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).

For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.

Here is an exemplary dataset:

business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc") 
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8) 
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5) 
df <- cbind(business, rolingcount, stars)

I feel my problem is related to this, but with a gist, that I don’t get to work: Numbering rows within groups in a data frame

I also unsuccessfully experimented with while loops.

Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).

business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`

Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! 🙂

>Solution :

I don’t know if the option is really fast, I’m not used to deal with that many rows.
Here is a solution using dplyr package in the tidyverse :

library(tidyverse)
df %>% 
  as.data.frame() %>% 
  group_by(business) %>% 
  mutate(stars = as.numeric(stars),
         lag_stars = lag(stars, 1, default = 0),
         previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
         previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
         previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
         previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
         previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>% 
  ungroup() %>% 
  select(-lag_stars)

Output :

# A tibble: 16 x 8
   business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
   <chr>    <chr>       <dbl>         <dbl>         <dbl>          <dbl>        <dbl>        <dbl>
 1 aaa      1               5            NA            NA             NA           NA           NA
 2 aaa      2               5             1             0              0            0            0
 3 aaa      3               3             2             0              0            0            0
 4 bbb      1               5            NA            NA             NA           NA           NA
 5 bbb      2               5             1             0              0            0            0
 6 bbb      3               1             2             0              0            0            0
 7 bbb      4               2             2             0              0            0            1
 8 bbb      5               3             2             0              0            1            1
 9 ccc      1               5            NA            NA             NA           NA           NA
10 ccc      2               1             1             0              0            0            0
11 ccc      3               2             1             0              0            0            1
12 ccc      4               3             1             0              0            1            1
13 ccc      5               4             1             0              1            1            1
14 ccc      6               5             1             1              1            1            1
15 ccc      7               5             2             1              1            1            1
16 ccc      8               5             3             1              1            1            1

Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum.
Maybe it’ll lead you to another faster idea if it is too slow.
Hope it helped.