Grouping using row number function

July 15, 2022

I have been using the row_number() function to only select the observations that I need.
In my scenario, whenever there is two different name for a particular <id, entity_id, period, element>, the National one should be left-out. In case there is only one, take the only one.

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123    |   2021 | Overall value | Overall Estimation        |
+----+-----------+--------+---------------+---------------------------+

With cases like above, the following did the trick:

SELECT *
FROM   (SELECT *,
               Row_number()
                 OVER (
                   partition BY id, entity_id, period, element
                   ORDER BY NAME DESC) AS rn
        FROM   mydata) table
WHERE  table.rn = 1

Problem is that now there are other cases like the following:

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Based - ZYX      |
| 12 | ABC123    |   2021 | Overall value | Base Estimation           |
+----+-----------+--------+---------------+---------------------------+

And with the current SQL this would not work as I would have to change the order by from descending to ascending.

Is there any possibility to de-prioritize the "National…" record and take the other one in case there are multiple ones?

I am running the query on Hive/Impala.

>Solution :

If you add another derived-table layer (or use a CTE) then you can add a CASE WHEN to check for "name" starting with 'National' and give it a simple integer "tag" value you can use to de-prioritize those rows.

…like so:

WITH q AS (
    SELECT
        "id",
        "entity_id",
        "period",
        "element",
        "name",
        CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
    FROM
        mydata
),
filtered AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY
                "id", "entity_id", "period", "element"
            ORDER BY
                "tag" DESC,
                "name" DESC
        ) AS rn
    FROM
        q
)
SELECT
    *
FROM
    filtered 
WHERE
    rn = 1