I want to merge rows together that have the same word in another column. The solution should be in R Base. The table entries are all comma separated strings (characters), not lists. So as shown below the color shades for the same color should be added in a string in one row, instead of spanning over several rows. Also there should be no duplicates in the color shades column.
I have already tried with:
aggregate(df["Color shades"], df["Color"], paste, collapse=", ")
As well as with:
aggregate(Color shades ~ Color ,df ,toString)
But that didn’t lead to the desired result.
Dataframe:
df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue",
"navy blue, true blue"), colors = c("blue", "blue", "blue"))
Currently:
| Color shades | Color |
|---|---|
| turquoise, babyblue | blue |
| royal blue, true blue | blue |
| navy blue, true blue | blue |
Desired Output:
| Color shades | Color |
|---|---|
| turquoise, babyblue, royal blue, true blue, navy blue | blue |
>Solution :
-
Convert
"Color shades"to a list-column:lapply(strsplit(df[["Color shades"]], ","), trimws) # [[1]] # [1] "turquoise" "babyblue" # [[2]] # [1] "royal blue" "true blue" # [[3]] # [1] "navy blue" "true blue" df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws) df # Color shades Color # 1 turquoise, babyblue blue # 2 royal blue, true blue blue # 3 navy blue, true blue blue -
Aggregate with
unique:aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", ")) # Color Color shades # 1 blue turquoise, babyblue, royal blue, true blue, navy blueor, keeping with the list-column method,
aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))) # Color Color shades # 1 blue turquoise, babyblue, royal blue, true blue, navy blue str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))) # 'data.frame': 1 obs. of 2 variables: # $ Color : chr "blue" # $ Color shades:List of 1 # ..$ : chr "turquoise" "babyblue" "royal blue" "true blue" ...
There are often (but not always) advantages to dealing with list-columns vice comma-separated values. If your use-case is such that you frequently want to look at individual elements within one of these fields, you’ll find yourself dealing deeply with regexes and/or repeatedly using strsplit to separate. With list-columns, one can use tools like unique and %in% with abandon (though admittedly one should really become more comfortable with lapply/sapply, and many base-R tools for aggregation do not always work consistently with list-columns).
Data
df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))