Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: How to count the total number of tokens in a corpus?

I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words for all 190 texts.

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

you can just use the sum() function which is really simple. I left an example:

test <- c("testing string number 1","testing string number 2")

sum(quanteda::ntoken(test))

Result:

> quanteda::ntoken(test)
text1 text2 
    4     4 
> sum(quanteda::ntoken(test))
[1] 8
> 

In case you are using pipes, which is pretty common with quanteda

> quanteda::ntoken(test) %>% sum()
[1] 8
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading