Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compare one row of a column against all others in group

I am trying to calculate the number of days over which all objects in a group overlap with each member of the group. To do this I want to compare each row of one column in a group, to each other row in that column in the same group. However, I am unable to come up with a simple solution for this; most of my effort has been with the map variants from purrr. Aside from that I have gone down some nested loop (:-/), nested apply rabbit holes; but I suspect there is a very simple way to accomplish this comparison.

Essentially I want the sum of the intersect of each interval in a group to one row of the group.

Input data: (format with intervals)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID Group year  interval_obs  
1   A   2020 2020-04-29 UTC--2020-05-19 UTC  
2   A   2020 2020-05-04 UTC--2020-05-29 UTC  
3   A   2020 2020-05-09 UTC--2020-05-24 UTC  
4   A   2020 2020-04-24 UTC--2020-04-28 UTC  
5   A   2020 2020-05-30 UTC--2020-06-03 UTC  
6   B   2020 2019-12-31 UTC--2020-01-20 UTC  
7   B   2020 2020-01-10 UTC--2020-01-30 UTC  
8   B   2020 2020-01-20 UTC--2020-02-09 UTC  
9   B   2020 2020-01-15 UTC--2020-02-04 UTC

Input data (more human readable?) – where each start/end is the Day of Year (doy)

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155
6   B   2020    0   20
7   B   2020   10   30
8   B   2020   20   40
9   B   2020   15   35 

Desired Results:

ID  total_overlap  
  1   25  
  2   30  
  3   30  
  4    0  
  5    0  
  6   15  
  7   35  
  8   25  
  9   35  

note the desired total overlap is in days, the sum of all days the 4 other observations in group A overlap. Group B with 4 records to indicate variable lengths.

example data for problem

data <- structure(list(
  ID = 1:9,
  group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), 
  year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L,  2020L, 2020L, 2020L), 
  start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
  end = c(140L, 150L, 145L, 119L, 155L, 20L,  30L, 40L, 35L)),
  class = "data.frame", 
  row.names = c(NA, -9L))

data <- data %>% 
  group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
  mutate(interval_obs2 = interval_obs) %>%
  mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>%  #this year-1 term is due to leap years etc.
  mutate(interval_obs = interval(ymd(start), ymd(end))) %>% 
  dplyr::select(-start, -end)

output <- data %>% map(.x = .$interval_obs, # this code at least runs
              .f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs2)))})

The little chunk above is one of many types of way’s I have approached this (map2, map_df etc.), and while it does not work I imagine (…) a solution is in that ballpark. Note that my example output has two features: 1) units are converted to days, 2) the ‘self intersection’ is subtracted out. Do not worry about those features I have ways to do both of those, I just did not include those because they may obfuscate the problem. However if it helps…

mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>% 
mutate(results = results - self_intersection) %>% #Not sure about this line but... will be similar 
mutate(total_overlap = as.numeric(results)/86400)) 

I have been trying to keep data in lubridate or another date format so that different temporal resolutions could be easily accommodated in the future (e.g. hours, minutes)

edit 2 – example of calculating overlap for Group A (data reproduced here)

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155

for Group # 1, numbers after ‘comparision’ refer to ID

comparison 1 - 2. End1 - Start2 = 15 days  
comparison 1 - 3. End1 - Start2 = 10 days  
comparison 1 - 4. NO OVERLAP    =  0 days  
comparison 1 - 5. NO OVERLAP    =  0 days  
total_overlap                     25 days  

>Solution :

Is this what you are looking for?

The total overlap in the third line is off from your desired output, but that may be a typo?

library(tidyverse)
library(lubridate)

data |> 
  group_by(group) |> 
  mutate(total_overlap = map_dbl(interval_obs, 
                             \(x) x |> 
                               intersect(interval_obs) |> 
                               int_length() |> 
                               sum(na.rm = T) - int_length(x)
                               ) / 86400
         )
#> # A tibble: 9 × 5
#> # Groups:   group [2]
#>      ID group  year interval_obs                   total_overlap
#>   <int> <chr> <int> <Interval>                             <dbl>
#> 1     1 A      2020 2020-04-29 UTC--2020-05-19 UTC            25
#> 2     2 A      2020 2020-05-04 UTC--2020-05-29 UTC            30
#> 3     3 A      2020 2020-05-09 UTC--2020-05-24 UTC            25
#> 4     4 A      2020 2020-04-24 UTC--2020-04-28 UTC             0
#> 5     5 A      2020 2020-05-30 UTC--2020-06-03 UTC             0
#> 6     6 B      2020 2019-12-31 UTC--2020-01-20 UTC            15
#> 7     7 B      2020 2020-01-10 UTC--2020-01-30 UTC            35
#> 8     8 B      2020 2020-01-20 UTC--2020-02-09 UTC            25
#> 9     9 B      2020 2020-01-15 UTC--2020-02-04 UTC            35
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading