Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

group_by( ) and mutate( ) do not match sizes

I have a large data table with multiple columns and a custom function. The data table looks something like that, and there are eight different bird_ID types:

   GPS_ID bird_ID device_ID devicetype           timestamp       date
1:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02
2:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02
3:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02
4:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02
5:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02
6:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02

The custom function calculates the difference in time between the timestamp of two rows, and assigns a number in a new column named Position.Burst.ID. If the diff is more than 5 seconds, the number sequence advances, else it keeps the previously assigned number.

pbid <- function(data_table) {
  newbout <- which(c(TRUE, diff(as.POSIXct(data_table$timestamp, tz = "UTC")) >= 5) == T)
  boutind <- rep(seq_along(newbout), diff(c(newbout, (nrow(data_table) + 1))))
  data_table$Position.Burst.ID <- boutind
}

This function works great with one bird_ID.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   GPS_ID bird_ID device_ID devicetype           timestamp       date Position.Burst.ID   
1:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1
2:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1
3:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1
4:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1
5:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1
6:     NA    350E    202927   ornitela 2022-05-02 00:03:59 2022-05-02                 1

I wanted to group_by(bird_ID), so it will start counting from the top for each bird_ID

data_table %>%
  group_by(bird_ID) %>%
  mutate(Position.Burst.ID = pbid(data_table))

That surely didn’t work, because:

`Position.Burst.ID` must be size 419335 or 1, not 4592293.

Any ideas on how to approach this?

I have already tried to create a loop and put the function inside, but that was also a dead-end. And I really wanted to avoid using a for loop with this amount of data.

>Solution :

Here’s how I’d do it:

data_table %>%
  group_by(bird_ID) %>%
  mutate(Position.Burst.ID = cumsum(timestamp - lag(timestamp, default = timestamp[1]) >= 5) + 1)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading