Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Hunspell Workaround Empty Suggestion out of bounds error in R

I am trying to automatically spell-check a string column of a data.table/data.frame.

Looking around, I found several approaches that all give an "out of bounds" error in the case hunspell.suggest returns no suggestions (that is, an empty list, e.g. "pippasnjfjsfiadjg"), see approaches here (the accepted answer here yields NA so does work in principal) and here

We seem to require unlist in order to identify these empty suggestions and then exclude them from the part of the code that picks the first suggestion but I cannot figure out how.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(dplyr)
library(stringi)
library(hunspell)

df1 <- data.frame("Index" = 1:7, "Text" = c("pippasnjfjsfiadjg came to dinner with us tonigh.",
                                            "Wuld you like to trave with me?",
                                            "There is so muh to undestand.",
                                            "Sentences cone in many shaes and sizes.",
                                            "Learnin R is fun",
                                            "yesterday was Friday",
                                            "bing search engine"),
                  stringsAsFactors = FALSE)

# Get bad words.
badwords <- hunspell(df1$Text) %>% unlist

# Extract the first suggestion for each bad word.
suggestions <- sapply(hunspell_suggest(badwords), "[[", 1)

mutate(df1, Text = stri_replace_all_fixed(str = Text,
                                          pattern = badwords,
                                          replacement = suggestions,
                                          vectorize_all = FALSE)) -> out

>Solution :

You’ll want to filter the list of bad words and suggestions to get rid of those without suggestions

badwords <- hunspell(df1$Text) %>% unlist()
# note use of '[' rather than '[['
suggestions <- sapply(hunspell_suggest(badwords), '[', 1)

badwords <- badwords[!is.na(suggestions)]
suggestions <- suggestions[!is.na(suggestions)]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading