Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Number of different elements up to this point

I’ve got a relatively simple problem (I think) and I want to solve it in a fast and efficient way.

I want to count the number of different elements in a vector up to each point in this vector.

For example, in a vector like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")

I want to get the following vector of equal size as a result:
[1 2 3 3 3 3 4 4]

I could solve this of course with a for loop in combination with cumsum():

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
res <- T
for (i in 2:length(vec)) {
  res[i] <- !(vec[i] %in% vec[1:(i-1)])
}
cumsum(res)
[1] 1 2 3 3 3 3 4 4

However, I am dealing with vectors that have several million elements and a for-loop approach takes forever for such a relatively simple problem.

I have the intuition that this should be solvable much faster and more clever. Do you have any ideas? Thank you!

(In case you’re interested: I need this for a vocabulary growth curve analysis where we want to know at each point in the text how many different words, i.e. types, have been observed so far.)

>Solution :

Use cumsum on the non (!) duplicated values:

cumsum(!duplicated(vec))
#[1] 1 2 3 3 3 3 4 4

And another approach with match:

uni <- vector(length = length(vec))
uni[match(unique(vec), vec)] <- TRUE
cumsum(uni)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading