Creating new column by splitting a `chr` column, finding unique values, sorting them, removing certain values, and combining them back into one string

I’m working in R, using tidyverse and dplyr functions to generate new columns, but I’m running into a wall when trying to find unique values within a string column. Here’s a detailed description of the problem.

Setup

Suppose I have a tibble called df, with a chr column called col1 that contains strings. These strings are, in reality, a list of values that are separated by a comma (", "). Here’s what df looks like:

library(tidyverse)
library(dplyr)

df = data.frame(id=c(1,2,3,4,5),
                col1=c("a, b, x, a","b, b","c, b, b, b", "b, x, b, c", "c")) %>%
  as_tibble()

print(df)

# # A tibble: 5 x 2    
#      id col1         
#   <dbl> <chr>        
# 1     1 a, b, x, a      
# 2     2 b, b         
# 3     3 c, b, b, b
# 4     4 b, x, b, c      
# 5     5 c            

The problem

I want to separate the values in col1 wherever we find ", ", remove any duplicate values, sort the unique values, remove the "x" values, then concatenate them back together into a string using ", " as a delimiter between the multiple unique items.

In more practical terms, I would like to create a column as seen below in col2:

# # A tibble: 5 x 3            
#      id col1          col2   
#   <dbl> <chr>         <chr>  
# 1     1 a, b, x, a    a, b   
# 2     2 b, b          b      
# 3     3 c, b, b, b    b, c   
# 4     4 b, x, b, c    b, c
# 5     5 c             c      

My attempt so far

If I just have a string variable, I know that I can do all of the processing in a couple of steps:

x = "b, x, b, c"
x_temp = unique(strsplit(x, ", ")[[1]])
x_simp = paste(sort(x_temp[x_temp != "x"]), collapse=", ")
print(x_simp)
# [1] "b, c"

However, I’m having a hard time translating this process back into the mutate function:

newdf = df %>% 
  mutate(col2 = paste(sort(unique(strsplit(col1, ", ")[[1]])[unique(strsplit(col1, ", ")[[1]]) != "x"]), collapse=", "))

# A tibble: 5 x 3
#    id col1              col2 
# <dbl> <chr>             <chr>
#   1   1 a, b, x, a      a, b 
#   2   2 b, b            a, b 
#   3   3 c, b, b, b      a, b 
#   4   4 b, x, b, c      a, b 
#   5   5 c               a, b 

Summary

How can I use tidyverse/dplyr functions to generate a new column that is the result of the following processing steps on one of a tibble/df‘s columns:

  • split a string/character column using a custom delimiter,
  • find the unique values,
  • sort them,
  • remove unwanted values,
  • combine them back into a single string/character value using a custom delimiter
  • place the result in a new column

>Solution :

We may do this in tidyverse, by splitting with separate_rows, and then do a group by paste after removing the duplicates

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   mutate(col2 = col1) %>% 
   separate_rows(col2) %>%
   distinct(across(everything())) %>% 
   group_by(id, col1) %>% 
   summarise(col2 = str_c(sort(col2[col2 != "x"]), collapse = ", "),
       .groups = 'drop')

-output

# A tibble: 5 × 3
     id col1       col2 
  <dbl> <chr>      <chr>
1     1 a, b, x, a a, b 
2     2 b, b       b    
3     3 c, b, b, b b, c 
4     4 b, x, b, c b, c 
5     5 c          c    

Leave a Reply