Home Merge rows if previous row contains a string that starts with a particular sign

Questions

Merge rows if previous row contains a string that starts with a particular sign

March 24, 2023

I have a data frame that looks like this:

df <- as.data.frame(rbind(">A1", "aaaa", "bbb", "cccc",
            ">B2", "dddd", "eeeee","ff",
            ">C3", "ggggggg", "hhhhh", "iiiii", "jjjjj"))

This is what I want to get:

df1 <- as.data.frame(rbind(">A1", "aaaabbbcccc",
            ">B2", "ddddeeeeeff",
            ">C3", "ggggggghhhhhiiiiijjjjj"))

As you can see, I want to merge every row between two rows that contain a string starting with ">" sign.
Frankly, I don’t know where to start with this.
Please advise.

>Solution :

We can use cumsum(grepl(.)) for this.

data.frame(
  V1 = unlist(
    by(df$V1, cumsum(grepl("^>", df$V1)),
       function(z) c(z[1], paste(z[-1], collapse = "")))
  )
)
#                        V1
# 11                    >A1
# 12            aaaabbbcccc
# 21                    >B2
# 22            ddddeeeeeff
# 31                    >C3
# 32 ggggggghhhhhiiiiijjjjj

Brief explanation:

grepl(.) returns TRUE for each of the >-containing cells; then

cumsum assigns that row and all rows until the next occurrence the same number:

grepl(">", df$V1)
#  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
cumsum(grepl(">", df$V1))
#  [1] 1 1 1 1 2 2 2 2 3 3 3 3 3

by(.) does something to each of those groups; in this case, it returns a vector length 2, with the >-string first and all others concatenated.

Which is structured as your df1,

df1
#                       V1
# 1                    >A1
# 2            aaaabbbcccc
# 3                    >B2
# 4            ddddeeeeeff
# 5                    >C3
# 6 ggggggghhhhhiiiiijjjjj