Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

SQL: Need help to group by

this is my table:

Start Stop City
2022-01-01 2022-02-15 Rom
2022-02-16 2022-03-31 Rom
2022-04-01 2022-05-10 London
2022-05-11 2022-06-11 London
2022-06-12 2022-07-10 Paris
2022-07-11 2022-08-10 Rom

I like to get this result:

Start Stop City
2022-01-01 2022-03-31 Rom
2022-04-01 2022-06-11 London
2022-06-12 2022-07-10 Paris
2022-07-11 2022-08-10 Rom

If i use:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

SELECT City, MIN(Start) as STA, MAX(Stop) AS STO FROM living GROUP BY City

i get:

Start Stop City
2022-01-01 2022-08-10 Rom
2022-04-01 2022-06-11 London
2022-06-12 2022-07-10 Paris

>Solution :

WITH
  islands AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (                  ORDER BY start) AS seq,
    ROW_NUMBER() OVER (PARTITION BY city ORDER BY start) AS city_seq
  FROM
    living
)
SELECT
  city,
  MIN(start),
  MAX(stop)
FROM
  islands
GROUP BY
  city,
  seq - city_seq
ORDER BY
  MIN(start)

ROW_NUMBER() OVER (ORDER BY start) just creates a sequence of integers starting from 1, having ordered the data by start.

Adding PARTITION BY city does the same thing, but creates a separate sequence for each city.

The trick is then to use seq - city_seq. That tells you how many rows there have been so far that are NOT the current city.

Start Stop City seq city_seq s – cs
2022-01-01 2022-02-15 Rom 1 1 0
2022-02-16 2022-03-31 Rom 2 2 0
2022-04-01 2022-05-10 London 3 1 2
2022-05-11 2022-06-11 London 4 2 2
2022-06-12 2022-07-10 Paris 5 1 4
2022-07-11 2022-08-10 Rom 6 3 3

Then, the combination of city and rows so far for other cities (seq - city_seq) is a unique identifier for each "island" and can be used in a normal GROUP BY

Edit:

The above assumes there are no gaps or overlaps in the data, and so ordering by start is sufficient.

If there are gaps or overlaps a slightly longer approach is more customisable…

  • whenever the previous row isn’t ‘the same city, but 1 day later’, record a 1 for the row to indicate a "new island".
  • cumulativel sum those values over time, creating unique identifiers for each "island"
  • GROUP BY as normal
WITH
  islands AS
(
  SELECT
    *,
    CASE
      WHEN city  = LAG(city) OVER (ORDER BY start)
       AND start = LAG(stop) OVER (ORDER BY start) + INTERVAL '1 DAY'
      THEN 0
      ELSE 1
    END
      AS island_marker
  FROM
    living
),
  island_ids AS
(
  SELECT
    *,
    SUM(island_marker) OVER (ORDER BY start) AS island_id
  FROM
    islands
)
SELECT
  MIN(city),
  MIN(start),
  MAX(stop)
FROM
  island_ids
GROUP BY
  island_id
ORDER BY
  island_id
Start Stop City island_marker island_id
2022-01-01 2022-02-15 Rom 1 1
2022-02-16 2022-03-31 Rom 0 1
2022-04-01 2022-05-10 London 1 2
2022-05-11 2022-06-11 London 0 2
2022-06-12 2022-07-10 Paris 1 3
2022-07-11 2022-08-10 Rom 1 4
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading