I’m trying to find the last element in a data.table by group in a time efficient way. I have a solution that works:
library(data.table)
Data <- data.table(id = c(rep("a", 2), rep("b",3)),
time = c(1:2, 1:3))
Data[, lastobs := max(time), by = id]
Data <- Data[time == lastobs]
Data[, lastobs := NULL]
but the max() by group command takes pretty long. In my still manageable dataset of 3 million observations and time being a yearmonth variable, it takes 20 seconds and I need to do it for many yearmonths. I want to move to a much larger dataset now where this becomes infeasible. I was thinking there must be a data.table way to do this by simply ordering the data.table by id and time and then using these .I, .N, or .SD shorthands that I never understand to simply keep the last element by group without having to calculate something like max() within each group. Is there such a solution? My attempt:
Data <- data.table(id = c(rep("a", 2), rep("b",3)),
time = c(1:2, 1:3))
Data[,.N == .I, by = id]
selects the last row of the first group and the first row of the second group because I don’t really understand this syntax…
>Solution :
Assuming the data.table is sorted (if not, use setorder):
Data[, .SD[.N,], by = id]
# id time
#1: a 2
#2: b 3
From the documentation:
".SD is a data.table containing the Subset of x‘s Data for each group …"
".N is an integer, length 1, containing the number of rows in the group."