Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Rowwise comparison of the length of a string against a list of string lengths

Consider the following data frame with two columns of strings of variable length:

library("tidyverse")

df <- tibble(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
             ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))

# # A tibble: 4 Ă— 2
# REF                               ALT                                                                
# <chr>                             <chr>                                                              
# 1 TTG                               T                                                                  
# 2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
# 3 T                                 TTG                                                                
# 4 TTGTGTGTGTGTGTGTGTGTGT            TTGTGTGTGTGTGTGTGTGTGTGT  

Differently from column REF, column ALT sometimes includes several strings concatenated by comma (e.g. row 2).

I want to compare the length of strings in REF (REF_LEN) and ALT (ALT_LEN), and generate a TYPE column with values:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • "SNM" when REF_LEN = ALT_LEN
  • "INS" when REF_LEN < ALT_LEN
  • "DEL" when REF_LEN > ALT_LEN

But I want to do it in a way that, when several strings are present in ALT, the output of this new TYPE column contains these types as well separated by a comma. i.e., the expected output here would be:

"DEL"     "INS,DEL" "INS"     "INS"

So far, I know how to get the length of values in ALT, but I fail at collapsing these values, as the output will contain lengths from all ALTs in the table, not just pairwise (i.e. 1,35,31,3,24):

df %>%
  dplyr::mutate(REF_LEN = str_length(REF),
                ALT_LEN = str_split(ALT, ","),
                ALT_LEN = purrr::map(ALT_LEN, str_length) %>% unlist() %>% paste(collapse = ","))

Code above is incomplete as you can see, but I am also unable to work in a different direction using a helper function that gets the TYPE column above done. This will return many errors, but not sure why, since it seems to work nicely with values from ALT_LEN individually:

name <- function(alt_lens, ref_len) {
  alt_lens <- unlist(alt_lens)
  ifelse(alt_lens < ref_len, "DEL", ifelse(alt_lens > ref_len, "INS", "SNM"))
}

df %>%
  dplyr::mutate(REF_LEN = str_length(REF),
                ALT_LEN = str_split(ALT, ","),
                TYPE = purrr::map(ALT_LEN, str_length) %>% name(REF_LEN))

Any ideas? thanks!

>Solution :

Update: Removed first answer. Thanks to akrun for pointing me there!. The concept is the same: using nchar with case_when, the difference is to use separate_rows from tidyr package:

library(dplyr)
library(tidyr)

df %>% 
  mutate(id = row_number()) %>% 
  separate_rows(ALT, sep = ",") %>% 
  mutate(TYPE = case_when(nchar(REF)==nchar(ALT) ~ "SNM",
                             nchar(REF)< nchar(ALT) ~ "INS",
                             nchar(REF)> nchar(ALT) ~ "DEL",
                             TRUE ~ NA_character_)) %>% 
  group_by(id) %>% 
  mutate(TYPE = toString(TYPE)) %>% 
  slice(1)
 REF                               ALT                                    id TYPE    
  <chr>                             <chr>                               <int> <chr>   
1 TTG                               T                                       1 DEL     
2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT     2 INS, DEL
3 T                                 TTG                                     3 INS     
4 TTGTGTGTGTGTGTGTGTGTGT            TTGTGTGTGTGTGTGTGTGTGTGT                4 INS  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading