Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

SQL count distinct over partition by cumulatively

I am using AWS Athena (Presto based) and I have this table named base:

id category year month
1 a 2021 6
1 b 2022 8
1 a 2022 11
2 a 2022 1
2 a 2022 4
2 b 2022 6

I would like a query that counts the distinct values of the categories per id cumulatively per month and year but retaining the original columns, so I would like:

id category year month sumC
1 a 2021 6 1
1 b 2022 8 2
1 a 2022 11 2
2 a 2022 1 1
2 a 2022 4 1
2 b 2022 6 2

I’ve tried doing the following but they aren’t quite right:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

SELECT id, category, year, month, COUNT(category) 
OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;

This results in 1, 2, 3, 1, 2, 3 which is not what I want; what I really need is count(distinct) however, that’s not supported in window function; so I also tried the dense_rank trick:

DENSE_RANK() OVER (PARTITION BY id ORDER BY category) 
+ DENSE_RANK() OVER (PARTITION BY id ORDER BY category) 
- 1 as sumC

but because there is no ordering of year and month, then it just results in 2, 2, 2, 2, 2, 2

appreciate any help, thanks!!

>Solution :

One option is

  • creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
  • computing a running sum over this column, with the same partition
WITH cte AS (
    SELECT *, 
           CASE WHEN ROW_NUMBER() OVER(
                         PARTITION BY id, category
                         ORDER     BY year, month) = 1
                THEN 1 
                ELSE 0 
           END AS rn1
    FROM base
    ORDER BY id, 
             year_, 
             month_
)
SELECT id,
       category,
       year_,
       month_,
       SUM(rn1) OVER(
            PARTITION BY id
            ORDER     BY year, month 
       ) AS sumC
FROM cte

Does it work for you?

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading