Aggregate rows only if dates range overlap

July 11, 2023

I am trying to build a complex sql query for hours but still didn’t find any way to do it as expected.

Here is my table and my dataset :

create table Skills
(
ID varchar(10),
StartDate date,
EndDate date,
Skill varchar(10)
);

Insert into Skills values
('1','2021-01-01','2021-12-31','A'),
('1','2022-01-01','2022-12-31','B'),
('2','2021-01-01','2021-12-31','A'),
('2','2021-11-30','2022-12-31','B'),
('3','2021-01-01','2021-12-31','A'),
('3','2021-11-30','2022-12-31','B'),
('3','2022-11-30','2023-12-31','C'),
('4','2021-01-01','2021-12-31','A'),
('4','2022-01-01','2022-12-31','B'),
('4','2022-11-30','2023-12-31','C');

I would like to aggregate rows by ID only when dates range (StartDate, EndDate) overlap.
Here is the expected result :

1, 2021-01-01, 2021-12-31, A
1, 2022-01-01, 2022-12-31, B
2, 2021-01-01, 2022-12-31, B
3, 2021-01-01, 2023-12-31, C
4, 2021-01-01, 2021-12-31, A
4, 2022-01-01, 2023-12-31, C

When rows with overlapping dates range are aggregated, we need to keep the oldest StartDate, the newest EndDate and the Skill associated to the newest EndDate.

I tried so many queries with partition by, lag, cte, etc.

Could you help me find the right solution please ?

Thanks,
Regards

>Solution :

This is a gaps and islands problem, to solve it you can use lag() to determine where the "islands" start, Then use a cumulative sum() to determine gaps :

select id, min(start_date) as start_date, max(end_date) as end_date, max(Skill) as skill
from (
      select d.*,
      sum(case when DATEDIFF(prev_end_date, start_date) > 0 then 0 else 1 end)  over (partition by id order by start_date) as grp
      from (
            select d.*,
            lag(end_date) over (partition by id order by start_date) as prev_end_date
            from Skills d
      ) d
    ) d
group by id, grp
order by id, min(start_date);

Demo here