I am trying to build a complex sql query for hours but still didn’t find any way to do it as expected.
Here is my table and my dataset :
create table Skills
(
ID varchar(10),
StartDate date,
EndDate date,
Skill varchar(10)
);
Insert into Skills values
('1','2021-01-01','2021-12-31','A'),
('1','2022-01-01','2022-12-31','B'),
('2','2021-01-01','2021-12-31','A'),
('2','2021-11-30','2022-12-31','B'),
('3','2021-01-01','2021-12-31','A'),
('3','2021-11-30','2022-12-31','B'),
('3','2022-11-30','2023-12-31','C'),
('4','2021-01-01','2021-12-31','A'),
('4','2022-01-01','2022-12-31','B'),
('4','2022-11-30','2023-12-31','C');
I would like to aggregate rows by ID only when dates range (StartDate, EndDate) overlap.
Here is the expected result :
1, 2021-01-01, 2021-12-31, A
1, 2022-01-01, 2022-12-31, B
2, 2021-01-01, 2022-12-31, B
3, 2021-01-01, 2023-12-31, C
4, 2021-01-01, 2021-12-31, A
4, 2022-01-01, 2023-12-31, C
When rows with overlapping dates range are aggregated, we need to keep the oldest StartDate, the newest EndDate and the Skill associated to the newest EndDate.
I tried so many queries with partition by, lag, cte, etc.
Could you help me find the right solution please ?
Thanks,
Regards
>Solution :
This is a gaps and islands problem, to solve it you can use lag() to determine where the "islands" start, Then use a cumulative sum() to determine gaps :
select id, min(start_date) as start_date, max(end_date) as end_date, max(Skill) as skill
from (
select d.*,
sum(case when DATEDIFF(prev_end_date, start_date) > 0 then 0 else 1 end) over (partition by id order by start_date) as grp
from (
select d.*,
lag(end_date) over (partition by id order by start_date) as prev_end_date
from Skills d
) d
) d
group by id, grp
order by id, min(start_date);