Merge rows when same word appears in another column

December 9, 2021

I want to merge rows together that have the same word in another column. The solution should be in R Base. The table entries are all comma separated strings (characters), not lists. So as shown below the color shades for the same color should be added in a string in one row, instead of spanning over several rows. Also there should be no duplicates in the color shades column.

I have already tried with:

   aggregate(df["Color shades"], df["Color"], paste, collapse=", ")

As well as with:

   aggregate(Color shades ~ Color ,df ,toString)

But that didn’t lead to the desired result.

Dataframe:

    df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue", 
                         "navy blue, true blue"), colors = c("blue", "blue", "blue"))

Currently:

Color shades	Color
turquoise, babyblue	blue
royal blue, true blue	blue
navy blue, true blue	blue

Desired Output:

Color shades	Color
turquoise, babyblue, royal blue, true blue, navy blue	blue

>Solution :

Convert "Color shades" to a list-column:

lapply(strsplit(df[["Color shades"]], ","), trimws)
# [[1]]
# [1] "turquoise" "babyblue" 
# [[2]]
# [1] "royal blue" "true blue" 
# [[3]]
# [1] "navy blue" "true blue"
df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws)
df
#            Color shades Color
# 1   turquoise, babyblue  blue
# 2 royal blue, true blue  blue
# 3  navy blue, true blue  blue

Aggregate with unique:

aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", "))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue

or, keeping with the list-column method,

aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue
str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))))
# 'data.frame': 1 obs. of  2 variables:
#  $ Color       : chr "blue"
#  $ Color shades:List of 1
#   ..$ : chr  "turquoise" "babyblue" "royal blue" "true blue" ...

There are often (but not always) advantages to dealing with list-columns vice comma-separated values. If your use-case is such that you frequently want to look at individual elements within one of these fields, you’ll find yourself dealing deeply with regexes and/or repeatedly using strsplit to separate. With list-columns, one can use tools like unique and %in% with abandon (though admittedly one should really become more comfortable with lapply/sapply, and many base-R tools for aggregation do not always work consistently with list-columns).

Data

df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))