Delete duplicate word, comma and whitespace

December 8, 2021

How can I delete all the duplicate words alongside the following comma and whitespace using Regex in R?

So far I have come up with the following regular expression, that matches the duplicate, however not the comma and whitespace. :

    (\b\w+\b)(?=[\S\s]*\b\1\b)

An example list would be:

    blue, red, blue, yellow, green, blue

The output should look like:

    blue, red, yellow, green

So it would have to match two of the "blue" in this case, as well as the following comma and whitespace (if there is any).

>Solution :

Depends if your list is truly a list or if it is a string with comma’s

# your data is actually already a list/vector
v <- c("blue", "red", "blue", "yellow", "green", "blue")

unique(v)
[1] "blue"   "red"    "yellow" "green"

# if your data is actually a comma seperated string
s <- "blue, red, blue, yellow, green, blue"

# if output needs to be a vector
unique(strsplit(s, ", ")[[1]])
[1] "blue"   "red"    "yellow" "green" 

# if output needs to be a string again
paste(unique(strsplit(s, ", ")[[1]]), collapse = ", ")
[1] "blue, red, yellow, green"

Example based on the list column in a data.table or data.frame

dt <- data.table(
  id = rep(1:5),
  colors = list(
    c("blue", "red", "blue", "yellow", "green", "blue"),
    c("blue", "blue", "yellow", "green", "blue"),
    c("blue", "red", "blue", "yellow"),
    c("red", "red", "yellow", "yellow", "green", "blue"),
    c("black")
  )
)

## using data.table
library(data.table)
setDT(dt)
# use colors instead of clean_list to just fix the existing column
dt[, clean_list := lapply(colors, function(x) unique(x))]

## using dplyr
library(dplyr)
# use colors instead of clean_list to just fix the existing column
dt %>% mutate(clean_list = lapply(colors, function(x) unique(x)))

dt
#    id                           colors            clean_list
# 1:  1  blue,red,blue,yellow,green,blue blue,red,yellow,green
# 2:  2      blue,blue,yellow,green,blue     blue,yellow,green
# 3:  3             blue,red,blue,yellow       blue,red,yellow
# 4:  4 red,red,yellow,yellow,green,blue red,yellow,green,blue
# 5:  5                            black                 black