this is my table:
| Start | Stop | City |
|---|---|---|
| 2022-01-01 | 2022-02-15 | Rom |
| 2022-02-16 | 2022-03-31 | Rom |
| 2022-04-01 | 2022-05-10 | London |
| 2022-05-11 | 2022-06-11 | London |
| 2022-06-12 | 2022-07-10 | Paris |
| 2022-07-11 | 2022-08-10 | Rom |
I like to get this result:
| Start | Stop | City |
|---|---|---|
| 2022-01-01 | 2022-03-31 | Rom |
| 2022-04-01 | 2022-06-11 | London |
| 2022-06-12 | 2022-07-10 | Paris |
| 2022-07-11 | 2022-08-10 | Rom |
If i use:
SELECT City, MIN(Start) as STA, MAX(Stop) AS STO FROM living GROUP BY City
i get:
| Start | Stop | City |
|---|---|---|
| 2022-01-01 | 2022-08-10 | Rom |
| 2022-04-01 | 2022-06-11 | London |
| 2022-06-12 | 2022-07-10 | Paris |
>Solution :
WITH
islands AS
(
SELECT
*,
ROW_NUMBER() OVER ( ORDER BY start) AS seq,
ROW_NUMBER() OVER (PARTITION BY city ORDER BY start) AS city_seq
FROM
living
)
SELECT
city,
MIN(start),
MAX(stop)
FROM
islands
GROUP BY
city,
seq - city_seq
ORDER BY
MIN(start)
ROW_NUMBER() OVER (ORDER BY start) just creates a sequence of integers starting from 1, having ordered the data by start.
Adding PARTITION BY city does the same thing, but creates a separate sequence for each city.
The trick is then to use seq - city_seq. That tells you how many rows there have been so far that are NOT the current city.
| Start | Stop | City | seq | city_seq | s – cs |
|---|---|---|---|---|---|
| 2022-01-01 | 2022-02-15 | Rom | 1 | 1 | 0 |
| 2022-02-16 | 2022-03-31 | Rom | 2 | 2 | 0 |
| 2022-04-01 | 2022-05-10 | London | 3 | 1 | 2 |
| 2022-05-11 | 2022-06-11 | London | 4 | 2 | 2 |
| 2022-06-12 | 2022-07-10 | Paris | 5 | 1 | 4 |
| 2022-07-11 | 2022-08-10 | Rom | 6 | 3 | 3 |
Then, the combination of city and rows so far for other cities (seq - city_seq) is a unique identifier for each "island" and can be used in a normal GROUP BY
Edit:
The above assumes there are no gaps or overlaps in the data, and so ordering by start is sufficient.
If there are gaps or overlaps a slightly longer approach is more customisable…
- whenever the previous row isn’t ‘the same city, but 1 day later’, record a
1for the row to indicate a "new island". - cumulativel sum those values over time, creating unique identifiers for each "island"
GROUP BYas normal
WITH
islands AS
(
SELECT
*,
CASE
WHEN city = LAG(city) OVER (ORDER BY start)
AND start = LAG(stop) OVER (ORDER BY start) + INTERVAL '1 DAY'
THEN 0
ELSE 1
END
AS island_marker
FROM
living
),
island_ids AS
(
SELECT
*,
SUM(island_marker) OVER (ORDER BY start) AS island_id
FROM
islands
)
SELECT
MIN(city),
MIN(start),
MAX(stop)
FROM
island_ids
GROUP BY
island_id
ORDER BY
island_id
| Start | Stop | City | island_marker | island_id |
|---|---|---|---|---|
| 2022-01-01 | 2022-02-15 | Rom | 1 | 1 |
| 2022-02-16 | 2022-03-31 | Rom | 0 | 1 |
| 2022-04-01 | 2022-05-10 | London | 1 | 2 |
| 2022-05-11 | 2022-06-11 | London | 0 | 2 |
| 2022-06-12 | 2022-07-10 | Paris | 1 | 3 |
| 2022-07-11 | 2022-08-10 | Rom | 1 | 4 |