Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Understanding why grepl doesn't appear to be correctly identifying words in R

I’m trying to count occurrences of a word in a document (as part of some reseach I’m doing into how politicians use language). I don’t understand why the value I’m getting back in R is not the same as the value I get when I independently count the number of words.

#Counting the occurrences of the word 'migrant' in a political debate
fileContent <- readLines("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 20

This returns the number 20, however if I open the document and ctrl + f for ‘Migrant’ I get 22 hits (I understand that the above code should identify scenarios within strings as well as whole words).

I’ve also tried parsing the xml, but even more confusingly this returns only 18, despite the fact that again if I manually check the parsed data there are still 22 hits:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#Same as above but parsing the xml
fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
fileContent <- xml_find_all(fileContent, ".//speech")
fileContent <- xml_text(fileContent)
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 18

#Outputting the data to double-check values
output <- file("output.txt")
writeLines(fileContent, output)
close(output)

Can anyone help me to understand why these two pieces of code are not returning 22?

>Solution :

grepl will return True if it finds at least one occurance of migrant. If a string contains it twice, it will only be counted once. See this example:

sum(grepl("migrant", c("Something about migrants. Something else about migrants ")))

You can use the stringr package to do what you want:

fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
fileContent <- xml_find_all(fileContent, ".//speech")
fileContent <- xml_text(fileContent)
migrant_count <- stringr::str_count(tolower(fileContent), "migrant")
total_migrant_count <- sum(migrant_count)
print(total_migrant_count) # -> 22
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading