Return only the unique words

June 2, 2022

Lets say i have a string and i only want unique words in the sentence as separate elements

 a = "an apple is an apple"
word <- function(a){
  
  words<- c(strsplit(a,split = " "))
  return(unique(words))
}

word(a)

This returns

[[1]]
[1] "an"    "apple" "is"    "an"    "apple"

and the output im expecting is

'an','apple','is'

what im doing wrong? really appreciate any help

Cheers!

>Solution :

The problem is that wrapping strsplit(.) in c(.) does not change the fact that it is still a list, and unique will be operating at the list-level, not the word-level.

c(strsplit(rep(a, 2), "\\s+"))
# [[1]]
# [1] "an"    "apple" "is"    "an"    "apple"
# [[2]]
# [1] "an"    "apple" "is"    "an"    "apple"
unique(c(strsplit(rep(a, 2), "\\s+")))
# [[1]]
# [1] "an"    "apple" "is"    "an"    "apple"

Alternatives:

If length(a) is always 1, then perhaps

unique(strsplit(a, "\\s+")[[1]])
# [1] "an"    "apple" "is"

If length(a) can be 2 or more and you want a list of unique words for each sentence, then

a2 <- c("an apple is an apple", "a pear is a pear", "an orange is an orange")
lapply(strsplit(a2, "\\s+"), unique)
# [[1]]
# [1] "an"    "apple" "is"   
# [[2]]
# [1] "a"    "pear" "is"  
# [[3]]
# [1] "an"     "orange" "is"

(Note: this always returns a list, regardless of the number of sentences in the input.)

if length(a) can be 2 ore more and you want a unique words across all sentences, then
```
unique(unlist(strsplit(a2, "\\s+")))
# [1] "an"     "apple"  "is"     "a"      "pear"   "orange"
```
(Note: this method also works well when length(a) is 1.)