Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Delete duplicate word, comma and whitespace

How can I delete all the duplicate words alongside the following comma and whitespace using Regex in R?

So far I have come up with the following regular expression, that matches the duplicate, however not the comma and whitespace. :

    (\b\w+\b)(?=[\S\s]*\b\1\b)

An example list would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    blue, red, blue, yellow, green, blue

The output should look like:

    blue, red, yellow, green

So it would have to match two of the "blue" in this case, as well as the following comma and whitespace (if there is any).

>Solution :

Depends if your list is truly a list or if it is a string with comma’s

# your data is actually already a list/vector
v <- c("blue", "red", "blue", "yellow", "green", "blue")

unique(v)
[1] "blue"   "red"    "yellow" "green"

# if your data is actually a comma seperated string
s <- "blue, red, blue, yellow, green, blue"

# if output needs to be a vector
unique(strsplit(s, ", ")[[1]])
[1] "blue"   "red"    "yellow" "green" 

# if output needs to be a string again
paste(unique(strsplit(s, ", ")[[1]]), collapse = ", ")
[1] "blue, red, yellow, green"

Example based on the list column in a data.table or data.frame

dt <- data.table(
  id = rep(1:5),
  colors = list(
    c("blue", "red", "blue", "yellow", "green", "blue"),
    c("blue", "blue", "yellow", "green", "blue"),
    c("blue", "red", "blue", "yellow"),
    c("red", "red", "yellow", "yellow", "green", "blue"),
    c("black")
  )
)

## using data.table
library(data.table)
setDT(dt)
# use colors instead of clean_list to just fix the existing column
dt[, clean_list := lapply(colors, function(x) unique(x))]

## using dplyr
library(dplyr)
# use colors instead of clean_list to just fix the existing column
dt %>% mutate(clean_list = lapply(colors, function(x) unique(x)))

dt
#    id                           colors            clean_list
# 1:  1  blue,red,blue,yellow,green,blue blue,red,yellow,green
# 2:  2      blue,blue,yellow,green,blue     blue,yellow,green
# 3:  3             blue,red,blue,yellow       blue,red,yellow
# 4:  4 red,red,yellow,yellow,green,blue red,yellow,green,blue
# 5:  5                            black                 black
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading