Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Delete Rows After First "Break" Occurs

I am working with the R programming language.

I have the following dataset:

library(dplyr)

my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5)) 

> my_data
  id year var
1  1 2010   1
2  1 2011   7
3  1 2012   3
4  1 2013   9
5  1 2015   5
6  1 2016   6
7  2 2015  88
8  2 2016  12
9  2 2020   5

My Question: For each ID – I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For example:

  • When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
  • When ID = 2, the first "jump" occurs at 2016 – therefore, I would like to delete all rows after 2016.

This was my attempt to write the code for this problem:

final = my_data %>%
  group_by(id) %>%
  mutate(break_index = which(diff(year) > 1)[1]) %>%
  group_by(id, add = TRUE) %>%
  slice(1:break_index)

The code appears to be working – but I get the following warning messages which are concerning me:

Warning messages:
1: In 1:break_index :
  numerical expression has 6 elements: only the first used
2: In 1:break_index :
  numerical expression has 3 elements: only the first used

Can someone please tell me if I have done this correctly?

Thanks!

>Solution :

You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).

Here is another way to handle this.

library(dplyr)

my_data %>%
  group_by(id) %>%
  filter(row_number() <= which(diff(year) > 1)[1])

#     id  year   var
#  <dbl> <dbl> <dbl>
#1     1  2010     1
#2     1  2011     7
#3     1  2012     3
#4     1  2013     9
#5     2  2015    88
#6     2  2016    12

With dplyr 1.1.0, we can use temporary grouping with .by

my_data %>%
  filter(row_number() <= which(diff(year) > 1)[1], .by = id)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading