I have been using the row_number() function to only select the observations that I need.
In my scenario, whenever there is two different name for a particular <id, entity_id, period, element>, the National one should be left-out. In case there is only one, take the only one.
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123 | 2021 | Overall value | Overall Estimation |
+----+-----------+--------+---------------+---------------------------+
With cases like above, the following did the trick:
SELECT *
FROM (SELECT *,
Row_number()
OVER (
partition BY id, entity_id, period, element
ORDER BY NAME DESC) AS rn
FROM mydata) table
WHERE table.rn = 1
Problem is that now there are other cases like the following:
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Based - ZYX |
| 12 | ABC123 | 2021 | Overall value | Base Estimation |
+----+-----------+--------+---------------+---------------------------+
And with the current SQL this would not work as I would have to change the order by from descending to ascending.
Is there any possibility to de-prioritize the "National…" record and take the other one in case there are multiple ones?
I am running the query on Hive/Impala.
>Solution :
If you add another derived-table layer (or use a CTE) then you can add a CASE WHEN to check for "name" starting with 'National' and give it a simple integer "tag" value you can use to de-prioritize those rows.
…like so:
WITH q AS (
SELECT
"id",
"entity_id",
"period",
"element",
"name",
CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
FROM
mydata
),
filtered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
"id", "entity_id", "period", "element"
ORDER BY
"tag" DESC,
"name" DESC
) AS rn
FROM
q
)
SELECT
*
FROM
filtered
WHERE
rn = 1