I have a data frame that looks like this:
df <- as.data.frame(rbind(">A1", "aaaa", "bbb", "cccc",
">B2", "dddd", "eeeee","ff",
">C3", "ggggggg", "hhhhh", "iiiii", "jjjjj"))
This is what I want to get:
df1 <- as.data.frame(rbind(">A1", "aaaabbbcccc",
">B2", "ddddeeeeeff",
">C3", "ggggggghhhhhiiiiijjjjj"))
As you can see, I want to merge every row between two rows that contain a string starting with ">" sign.
Frankly, I don’t know where to start with this.
Please advise.
>Solution :
We can use cumsum(grepl(.)) for this.
data.frame(
V1 = unlist(
by(df$V1, cumsum(grepl("^>", df$V1)),
function(z) c(z[1], paste(z[-1], collapse = "")))
)
)
# V1
# 11 >A1
# 12 aaaabbbcccc
# 21 >B2
# 22 ddddeeeeeff
# 31 >C3
# 32 ggggggghhhhhiiiiijjjjj
Brief explanation:
-
grepl(.)returnsTRUEfor each of the>-containing cells; then -
cumsumassigns that row and all rows until the next occurrence the same number:grepl(">", df$V1) # [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE cumsum(grepl(">", df$V1)) # [1] 1 1 1 1 2 2 2 2 3 3 3 3 3 -
by(.)does something to each of those groups; in this case, it returns a vector length 2, with the>-string first and all others concatenated.
Which is structured as your df1,
df1
# V1
# 1 >A1
# 2 aaaabbbcccc
# 3 >B2
# 4 ddddeeeeeff
# 5 >C3
# 6 ggggggghhhhhiiiiijjjjj