Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Converting time series transaction data into an occurrence count table of all products in R

I have 5 million+ rows of daily transactions data in a column format of (year, month, day of the month, hour of the day, product that is purchased):
I created this data frame as an example (please note that the original data is not ordered as I have given in this example and the range of date spreads over a span of 4 years.):

enter image description here

all_product_names = c("a","b","c","d","e","f","g","h","i")
my_table <- data.frame(year = c(2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022),
           month = c(rep(4,12)),
           day = c(rep(3,12)),
           hour = c(0,0,0,0,0,0,1,1,1,1,2,2),
           product_name = c("a","b","c","a","c","c","d","f","a","b","c","f") )

My desired output is a table where for a given time instance (a specific combination of year-month-day-hour) count of all product purchases are extended into the table as columns. "all_product_names" variable gives you a list of all the products that exist. Notice that not every time instance has all products purchased, for those instances, the "count table" value for that product must be zero. Here is what I am trying to get my output grouping to look like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

Using sqldf I have tried to group by product so that I get the count of product_name for every time instance, however It does not count into account of products that do not exist for every instance. And also I need to convert the row format into columns as indicated above in the solution table example.

xx <-sqldf('SELECT year,month,day,hour,product_name,COUNT(product_name) FROM my_table GROUP BY product_name,hour,day,month,year
      ORDER BY product_name ASC')

Also as a note I do not have to use sqldf library. Thank you for your time!

>Solution :

library(tidyverse)
my_table %>%
  count(year, month, day, hour, product_name) %>%
  pivot_wider(names_from = product_name, values_from = n, values_fill = 0)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading