Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

dplyr group_by retaining extra columns after summarise

I am at a total loss for this one – I am playing with the "pedestrian" dataset from tsibble. I want to get total counts for each month/year. I started by adding a month_year column, then summarise the data with sum, like so:

library("tidyverse")
library("tsibble")

df1 <- pedestrian
df1$month_year <- format(as.Date(df1$Date), "%Y-%m")

count_all <- df1 %>%  
  dplyr::group_by(month_year) %>% 
  dplyr::summarise(total = sum(Count))

A summary of count_all looks like this:

  month_year          Date_Time                         total      
 Length:17542       Min.   :2015-01-01 00:00:00.0   Min.   :   12  
 Class :character   1st Qu.:2015-07-02 17:15:00.0   1st Qu.:  349  
 Mode  :character   Median :2016-01-01 11:30:00.0   Median : 2090  
                    Mean   :2016-01-01 11:44:40.2   Mean   : 2593  
                    3rd Qu.:2016-07-02 04:45:00.0   3rd Qu.: 4455  
                    Max.   :2016-12-31 23:00:00.0   Max.   :15990

Why is Date_Time being retained? And how can I prevent it form impacting the summary (as in prevent it from giving me 17,542 rows instead of the expected 24). If I remove the column before the summary like so:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df1$Date_Time <- NULL

Then it works fine, and a summary of the result looks like this:

  month_year            total        
 Length:24          Min.   :1148276  
 Class :character   1st Qu.:1756898  
 Mode  :character   Median :1927154  
                    Mean   :1895161  
                    3rd Qu.:2066043  
                    Max.   :2393675  

This solution is fine, but I would like to know what the cause of the issue is so that I can avoid it in future (it was easy to catch the problem this time, but may not always be so straight forward).

Thanks in advance for the help!

>Solution :

The dataset pedestrian is a tsibble with sensor as a key and Date_Time as the index. Any operation you do on the tsibble will retain the index. You can remove the index by converting back to a tibble.

pedestrian %>%
  as_tibble() %>% 
  mutate(ym = yearmonth(Date)) %>% 
  dplyr::group_by(ym) %>% 
  dplyr::summarise(total = sum(Count))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading