I have 2 table.
tab1:
| dates. |
|---|
| 2021-09-30 |
| 2021-10-01 |
| 2021-10-02 |
| 2021-10-04 |
| 2021-11-15 |
buckets:
| bucket_dates |
|---|
| 2021-10-01. |
| 2021-10-02. |
| 2021-10-03. |
| 2021-11-03. |
I want to join these two table to get final result like below. (If there is a matching date then it will match and if there is no matching date, then it should read the next highest date from tab1 table. e.g. 2021-10-03 & 2021-11-03 dates).
Result table:
| bucket_dates | final_dates |
|---|---|
| 2021-10-01. | 2021-10-01 |
| 2021-10-02. | 2021-10-02 |
| 2021-10-03. | 2021-10-04 |
| 2021-11-03. | 2021-11-15 |
I tried to do this by using join query
select a.bucket_dates,
case when b.dates is null then (select min(c.dates) from tab1 c where c.dates > a.bucket_dates)
else b.dates end as final_dates
from buckets a left join tab1 b
on a.bucket_dates = b.dates;
but this query is giving below error
Correlated column is not allowed in a non-equality predicate
Any suggestion will be really helpful.
>Solution :
If pyspark won’t allow > or >= in a correlated sub-query, just use the MIN() in the main query…
SELECT
b.bucket_dates,
MIN(t.dates) AS final_dates
FROM
buckets AS b
LEFT JOIN
tab1 AS t
ON t.dates >= b.bucket_dates
GROUP BY
b.bucket_dates