Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Counting previous occurrences of certain variable per group and storing as new column

I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).

For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.

Here is an exemplary dataset:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc") 
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8) 
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5) 
df <- cbind(business, rolingcount, stars)

I feel my problem is related to this, but with a gist, that I don’t get to work: Numbering rows within groups in a data frame

I also unsuccessfully experimented with while loops.

Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).

business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`

Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! 🙂

>Solution :

I don’t know if the option is really fast, I’m not used to deal with that many rows.
Here is a solution using dplyr package in the tidyverse :

library(tidyverse)
df %>% 
  as.data.frame() %>% 
  group_by(business) %>% 
  mutate(stars = as.numeric(stars),
         lag_stars = lag(stars, 1, default = 0),
         previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
         previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
         previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
         previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
         previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>% 
  ungroup() %>% 
  select(-lag_stars)

Output :

# A tibble: 16 x 8
   business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
   <chr>    <chr>       <dbl>         <dbl>         <dbl>          <dbl>        <dbl>        <dbl>
 1 aaa      1               5            NA            NA             NA           NA           NA
 2 aaa      2               5             1             0              0            0            0
 3 aaa      3               3             2             0              0            0            0
 4 bbb      1               5            NA            NA             NA           NA           NA
 5 bbb      2               5             1             0              0            0            0
 6 bbb      3               1             2             0              0            0            0
 7 bbb      4               2             2             0              0            0            1
 8 bbb      5               3             2             0              0            1            1
 9 ccc      1               5            NA            NA             NA           NA           NA
10 ccc      2               1             1             0              0            0            0
11 ccc      3               2             1             0              0            0            1
12 ccc      4               3             1             0              0            1            1
13 ccc      5               4             1             0              1            1            1
14 ccc      6               5             1             1              1            1            1
15 ccc      7               5             2             1              1            1            1
16 ccc      8               5             3             1              1            1            1

Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum.
Maybe it’ll lead you to another faster idea if it is too slow.
Hope it helped.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading