I’ve got a relatively simple problem (I think) and I want to solve it in a fast and efficient way.
I want to count the number of different elements in a vector up to each point in this vector.
For example, in a vector like this
vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
I want to get the following vector of equal size as a result:
[1 2 3 3 3 3 4 4]
I could solve this of course with a for loop in combination with cumsum():
vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
res <- T
for (i in 2:length(vec)) {
res[i] <- !(vec[i] %in% vec[1:(i-1)])
}
cumsum(res)
[1] 1 2 3 3 3 3 4 4
However, I am dealing with vectors that have several million elements and a for-loop approach takes forever for such a relatively simple problem.
I have the intuition that this should be solvable much faster and more clever. Do you have any ideas? Thank you!
(In case you’re interested: I need this for a vocabulary growth curve analysis where we want to know at each point in the text how many different words, i.e. types, have been observed so far.)
>Solution :
Use cumsum on the non (!) duplicated values:
cumsum(!duplicated(vec))
#[1] 1 2 3 3 3 3 4 4
And another approach with match:
uni <- vector(length = length(vec))
uni[match(unique(vec), vec)] <- TRUE
cumsum(uni)