Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R Data.Table Keep All Rows of First Occurence per Group

I have a data.table with groups and values. I want to keep all entries that are within a group’s first occurence (from top of the table to the bottom).

Example:

set.seed(666)
group = c(1,1,1,2,2,3,3,3,1,1,4,4,4,1,1,2)
value = runif(16)
DT = data.table(group,value)

 > DT
 group      value
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     3 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
 9:     1 0.01331584
10:     1 0.25994613
11:     4 0.77589308
12:     4 0.01637905
13:     4 0.09574478
14:     1 0.14216354
15:     1 0.21112624
16:     2 0.81125644

What I want to achieve (row 9, 10, 14, 15 and 16 being removed as group 1 and 2 appeared before already):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 > DT
 group      value
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     3 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
11:     4 0.77589308
12:     4 0.01637905
13:     4 0.09574478

I’ve figured that DT[,.SD[1], by = "group", .SDcols = "value"] gives me the first entry per group but I want all entries until the groups change (e.g. the first three entries of group 1 in this particular case).

I thought about using something like DT[,.I, by = group] which gives me the row indices per entry sorted by groups but I have absolutely no idea how to elegantly identify such "breaks" within a data.table expression.

Edit: I find the maximum row number per group like this but don’t know how to continue from here:

maxRow = setnames(DT[, na.omit(which(diff(.I) > 1)[1]), by = "group"], "V1", "maxRow")

>Solution :

One possible solution:

DT[,.(group,value,rleid=rleid(group))][,.SD[rleid==min(rleid),.(value)],by=group]
    group      value
    <num>      <num>
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     2 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
 9:     3 0.01331584
10:     4 0.01637905
11:     4 0.09574478
12:     4 0.14216354
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading