Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Rearranging a data frame by equal strings

I have a data frame with the following structure:

pos<- c(67,125,158,195,235,458,499,526,785,912,999,1525)
v_1<-c("j","c","v","r","s","q","r","r","s","t","u","v")
v_2<-c("c","t","v","r","s","q","r","w","c","c","o","v")
v_3<-c("z","c","v","r","s","q","r","w","c","b","p","v")
v_4<-c("x","w","z","z","s","q","r","w","c","o","t","v")
data<-as.data.frame(cbind(pos,v_1,v_2,v_3,v_4))

In this dataframe it is possible to find the same letters among the different columns in consecutive rows. I need to obtain a separate data frame with the values of the variable "pos" for these shared letters in continuos rows, as can bee seen in the figure:
enter image description here

In this figure even all the columsn have the same letter in pos 1525 Is not importat since is not in a consecutive row. What I need to get is the dataframe in the right.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Solution using tidyr and dplyr:

  1. After pivoting to long, use dplyr::add_count() to find repeated values within each pos;
  2. Within each v, find consecutive rows with repeated values, defined as: >1 repeat and >1 repeat in either preceding or following row;
  3. Create a column containing pos for consecutive rows and NA otherwise;
  4. Take the minimum and maximum to get start and end for each v.
library(tidyr)
library(dplyr)

data %>%
  pivot_longer(!pos, names_to = "v") %>%
  add_count(pos, value) %>%
  group_by(v) %>%
  mutate(consec = ifelse(
    n > 1 & (lag(n) > 1 | lead(n) > 1),
    pos,
    NA
  )) %>%
  summarize(
    start = min(consec, na.rm = TRUE),
    end = max(consec, na.rm = TRUE)
  )
# A tibble: 4 × 3
  v     start end  
  <chr> <chr> <chr>
1 v_1   125   499  
2 v_2   158   785  
3 v_3   125   785  
4 v_4   235   785  

Note, not sure if/how you want to handle if there is more than one set of consecutive rows, so this solution doesn’t address that.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading