Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Grouping using row number function

I have been using the row_number() function to only select the observations that I need.
In my scenario, whenever there is two different name for a particular <id, entity_id, period, element>, the National one should be left-out. In case there is only one, take the only one.

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123    |   2021 | Overall value | Overall Estimation        |
+----+-----------+--------+---------------+---------------------------+

With cases like above, the following did the trick:

SELECT *
FROM   (SELECT *,
               Row_number()
                 OVER (
                   partition BY id, entity_id, period, element
                   ORDER BY NAME DESC) AS rn
        FROM   mydata) table
WHERE  table.rn = 1 

Problem is that now there are other cases like the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Based - ZYX      |
| 12 | ABC123    |   2021 | Overall value | Base Estimation           |
+----+-----------+--------+---------------+---------------------------+

And with the current SQL this would not work as I would have to change the order by from descending to ascending.

Is there any possibility to de-prioritize the "National…" record and take the other one in case there are multiple ones?

I am running the query on Hive/Impala.

>Solution :

If you add another derived-table layer (or use a CTE) then you can add a CASE WHEN to check for "name" starting with 'National' and give it a simple integer "tag" value you can use to de-prioritize those rows.

…like so:

WITH q AS (
    SELECT
        "id",
        "entity_id",
        "period",
        "element",
        "name",
        CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
    FROM
        mydata
),
filtered AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY
                "id", "entity_id", "period", "element"
            ORDER BY
                "tag" DESC,
                "name" DESC
        ) AS rn
    FROM
        q
)
SELECT
    *
FROM
    filtered 
WHERE
    rn = 1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading