First of all, I’m using BigQuery. In my table I have some duplicated rows due to a change in one of the columns (called fmid). When this happens, I have to get the latest record, which is defined by the time (called created_date). The only thing these duplicated records have in common is the id (called tid).
I have tried two different options to set which row is the first and which one is the latest, so then I could remove what I want.
First one:
SELECT *, row_number() OVER( PARTITION BY tid) as number
FROM (SELECT created_date,tid, fmid
FROM table
ORDER BY tid,created_date)
ORDER BY tid, created_date
Results:
Sometimes it’s correct, but the order of the id 296644550 is not right.
1 2023-02-03 22:11:07 UTC 233330088 4 1
2 2023-02-04 10:07:06 UTC 233330088 18 2
3 2023-02-11 11:06:42 UTC 284447799 4 1
4 2023-02-11 11:22:13 UTC 284447799 18 2
5 2023-02-10 10:39:23 UTC 296644550 4 2
6 2023-02-10 19:32:08 UTC 296644550 18 1
Second one:
SELECT *, row_number() OVER( PARTITION BY tid,created_date) as number
FROM (SELECT created_date,tid, fmid
FROM table
ORDER BY tid,created_date)
ORDER BY tid, created_date
Results:
The column "number" gets the same result.
Row created_date tid fmid number
1 2023-02-03 22:11:07 UTC 233330088 4 1
2 2023-02-04 10:07:06 UTC 233330088 18 1
3 2023-02-11 11:06:42 UTC 284447799 4 1
4 2023-02-11 11:22:13 UTC 284447799 18 1
5 2023-02-10 10:39:23 UTC 296644550 4 1
6 2023-02-10 19:32:08 UTC 296644550 18 1
Any idea of what’s wrong?
>Solution :
the sorting has to happen in the window of row_number, so that you get the correct number
SELECT created_date,tid, fmid
FROM (SELECT created_date,tid, fmid
, row_number() OVER( PARTITION BY tid ORDER BY created_date) as number
FROM table
)
WHERE number = 2
ORDER BY tid
or as CTE
WITH CTE AS (
SELECT created_date,tid, fmid
, row_number() OVER( PARTITION BY tid ORDER BY created_date) as number
FROM table )
SELECT created_date,tid, fmid
FROM CTE
WHERE number = 2
ORDER BY tid