SQL: Need help to group by

March 2, 2023

this is my table:

Start	Stop	City
2022-01-01	2022-02-15	Rom
2022-02-16	2022-03-31	Rom
2022-04-01	2022-05-10	London
2022-05-11	2022-06-11	London
2022-06-12	2022-07-10	Paris
2022-07-11	2022-08-10	Rom

I like to get this result:

Start	Stop	City
2022-01-01	2022-03-31	Rom
2022-04-01	2022-06-11	London
2022-06-12	2022-07-10	Paris
2022-07-11	2022-08-10	Rom

If i use:

SELECT City, MIN(Start) as STA, MAX(Stop) AS STO FROM living GROUP BY City

i get:

Start	Stop	City
2022-01-01	2022-08-10	Rom
2022-04-01	2022-06-11	London
2022-06-12	2022-07-10	Paris

>Solution :

WITH
  islands AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (                  ORDER BY start) AS seq,
    ROW_NUMBER() OVER (PARTITION BY city ORDER BY start) AS city_seq
  FROM
    living
)
SELECT
  city,
  MIN(start),
  MAX(stop)
FROM
  islands
GROUP BY
  city,
  seq - city_seq
ORDER BY
  MIN(start)

ROW_NUMBER() OVER (ORDER BY start) just creates a sequence of integers starting from 1, having ordered the data by start.

Adding PARTITION BY city does the same thing, but creates a separate sequence for each city.

The trick is then to use seq - city_seq. That tells you how many rows there have been so far that are NOT the current city.

Start	Stop	City	seq	city_seq	s – cs
2022-01-01	2022-02-15	Rom	1	1	0
2022-02-16	2022-03-31	Rom	2	2	0
2022-04-01	2022-05-10	London	3	1	2
2022-05-11	2022-06-11	London	4	2	2
2022-06-12	2022-07-10	Paris	5	1	4
2022-07-11	2022-08-10	Rom	6	3	3

Then, the combination of city and rows so far for other cities (seq - city_seq) is a unique identifier for each "island" and can be used in a normal GROUP BY

Edit:

The above assumes there are no gaps or overlaps in the data, and so ordering by start is sufficient.

If there are gaps or overlaps a slightly longer approach is more customisable…

whenever the previous row isn’t ‘the same city, but 1 day later’, record a 1 for the row to indicate a "new island".
cumulativel sum those values over time, creating unique identifiers for each "island"
GROUP BY as normal

WITH
  islands AS
(
  SELECT
    *,
    CASE
      WHEN city  = LAG(city) OVER (ORDER BY start)
       AND start = LAG(stop) OVER (ORDER BY start) + INTERVAL '1 DAY'
      THEN 0
      ELSE 1
    END
      AS island_marker
  FROM
    living
),
  island_ids AS
(
  SELECT
    *,
    SUM(island_marker) OVER (ORDER BY start) AS island_id
  FROM
    islands
)
SELECT
  MIN(city),
  MIN(start),
  MAX(stop)
FROM
  island_ids
GROUP BY
  island_id
ORDER BY
  island_id

Start	Stop	City	island_marker	island_id
2022-01-01	2022-02-15	Rom	1	1
2022-02-16	2022-03-31	Rom	0	1
2022-04-01	2022-05-10	London	1	2
2022-05-11	2022-06-11	London	0	2
2022-06-12	2022-07-10	Paris	1	3
2022-07-11	2022-08-10	Rom	1	4