how to get 1st column value using partition

March 10, 2023

I have a table looks like this

CREATE TABLE myTable
(
    userid      text,
    webid         text,
    "timestamp" timestamp
);

INSERT INTO myTable
    ("userid", "webid", "timestamp")
VALUES ('A', '34', '2023-01-31 17:34:49.000'),
       ('A', '73', '2023-01-31 17:34:50.000'),
       ('A', '97', '2023-01-31 17:34:58.000'),
       ('A', '17', '2023-01-31 17:35:02.000'),
       ('A', '17', '2023-01-31 17:35:07.000'),
       ('A', '17', '2023-01-31 17:35:18.000'),
       ('A', '17', '2023-01-31 17:35:30.000'),
       ('A', '1', '2023-01-31 17:35:37.000'),
       ('A', '1', '2023-01-31 17:35:38.000'),
       ('A', '77', '2023-01-31 17:35:41.000'),
       ('A', '77', '2023-01-31 17:35:42.000'),
       ('A', '1', '2023-01-31 17:37:10.000'),
       ('A', '1', '2023-01-31 17:37:12.000'),
       ('A', '77', '2023-01-31 17:37:14.000'),
       ('A', '77', '2023-01-31 17:37:15.000'),
       ('B', '33', '2023-01-31 06:37:15.000'),
       ('B', '56', '2023-01-31 06:40:15.000')
  ;

userid	webid	timestamp
A	34	2023-01-31 17:34:49
A	73	2023-01-31 17:34:50
A	97	2023-01-31 17:34:58
A	17	2023-01-31 17:35:02
A	17	2023-01-31 17:35:07
A	17	2023-01-31 17:35:18
A	17	2023-01-31 17:35:30
A	1	2023-01-31 17:35:37
A	1	2023-01-31 17:35:38
A	77	2023-01-31 17:35:41
A	77	2023-01-31 17:35:42
A	1	2023-01-31 17:37:10
A	1	2023-01-31 17:37:12
A	77	2023-01-31 17:37:14
A	77	2023-01-31 17:37:15
B	33	2023-01-31 06:37:15
B	56	2023-01-31 06:40:15

The correct output I want is this:

userid	webid	timestamp	first_ts
A	34	2023-01-31 17:34:49	2023-01-31 17:34:49
A	73	2023-01-31 17:34:50	2023-01-31 17:34:50
A	97	2023-01-31 17:34:58	2023-01-31 17:34:58
A	17	2023-01-31 17:35:02	2023-01-31 17:35:02
A	17	2023-01-31 17:35:07	2023-01-31 17:35:02
A	17	2023-01-31 17:35:18	2023-01-31 17:35:02
A	17	2023-01-31 17:35:30	2023-01-31 17:35:02
A	1	2023-01-31 17:35:37	2023-01-31 17:35:37
A	1	2023-01-31 17:35:38	2023-01-31 17:35:37
A	77	2023-01-31 17:35:41	2023-01-31 17:35:41
A	77	2023-01-31 17:35:42	2023-01-31 17:35:41
A	1	2023-01-31 17:37:10	2023-01-31 17:37:10
A	1	2023-01-31 17:37:12	2023-01-31 17:37:10
A	77	2023-01-31 17:37:14	2023-01-31 17:37:14
A	77	2023-01-31 17:37:15	2023-01-31 17:37:14
B	33	2023-01-31 06:37:15	2023-01-31 06:37:15
B	56	2023-01-31 06:40:15	2023-01-31 06:40:15

first_ts means first timestamp. If a userid visits consecutive webid, their first_ts should be the first timestamp of this webid. If the userid visits webid let’s say webid = 77 and jumps to another webid = 1 and then jumps back to webid = 77, the first_ts should be refreshed, it won’t be the very first one. I used following code to do that, but unable to get correct first_ts for userid who jumps back to previous webid.

select userid, webid, 
lag(webid,1) over(partition by userid order by timestamp desc) as next_webid,
lag(webid,1) over(partition by userid order by timestamp asc) as previous_webid,
timestamp,
case when webid = next_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     when webid = next_webid and webid = previous_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     when webid = previous_webid then first_value(timestamp) over(partition by userid, webid order by timestamp) 
     else timestamp
end as first_ts
from myTable 
order by userid, timestamp;

this wrong query will return output

userid	webid	timestamp	first_ts
A	34	2023-01-31 17:34:49	2023-01-31 17:34:49
A	73	2023-01-31 17:34:50	2023-01-31 17:34:50
A	97	2023-01-31 17:34:58	2023-01-31 17:34:58
A	17	2023-01-31 17:35:02	2023-01-31 17:35:02
A	17	2023-01-31 17:35:07	2023-01-31 17:35:02
A	17	2023-01-31 17:35:18	2023-01-31 17:35:02
A	17	2023-01-31 17:35:30	2023-01-31 17:35:02
A	1	2023-01-31 17:35:37	2023-01-31 17:35:37
A	1	2023-01-31 17:35:38	2023-01-31 17:35:37
A	77	2023-01-31 17:35:41	2023-01-31 17:35:41
A	77	2023-01-31 17:35:42	2023-01-31 17:35:41
A	1	2023-01-31 17:37:10	2023-01-31 17:35:37
A	1	2023-01-31 17:37:12	2023-01-31 17:35:37
A	77	2023-01-31 17:37:14	2023-01-31 17:35:41
A	77	2023-01-31 17:37:15	2023-01-31 17:35:41
B	33	2023-01-31 06:37:15	2023-01-31 06:37:15
B	56	2023-01-31 06:40:15	2023-01-31 06:40:15

in row 12, 13, 14, 15, the first_ts is wrong.
Anyone knows how to modify the query to get correct first_ts? Thanks!

>Solution :

This is a gaps and islands problem. One solution uses the difference in row numbers method:

WITH cte AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY timestamp) rn1,
              ROW_NUMBER() OVER (PARTITION BY userid, webid ORDER BY timestamp) rn2
    FROM myTable
)

SELECT userid, webid, timestamp,
       MIN(timestamp) OVER (PARTITION BY userid, webid, rn1 - rn2) AS first_ts
FROM cte
ORDER BY userid, timestamp;

The idea here is to use the row number difference to form a pseudo group, which is defined by the userid, webid, and rn1 - rn2. We then take the minimum timestamp for each such group/island to find the first_ts. To get a better understanding of how difference in row numbers works, just run SELECT * FROM cte and check the results.