Selecting data from two different tables with case statement

December 15, 2021

I have 2 table.

tab1:

dates.
2021-09-30
2021-10-01
2021-10-02
2021-10-04
2021-11-15

buckets:

bucket_dates
2021-10-01.
2021-10-02.
2021-10-03.
2021-11-03.

I want to join these two table to get final result like below. (If there is a matching date then it will match and if there is no matching date, then it should read the next highest date from tab1 table. e.g. 2021-10-03 & 2021-11-03 dates).

Result table:

bucket_dates	final_dates
2021-10-01.	2021-10-01
2021-10-02.	2021-10-02
2021-10-03.	2021-10-04
2021-11-03.	2021-11-15

I tried to do this by using join query

select a.bucket_dates,
   case when b.dates is null then (select min(c.dates) from tab1 c where c.dates > a.bucket_dates)
      else b.dates end as final_dates
from buckets a left join tab1 b 
 on a.bucket_dates = b.dates;

but this query is giving below error
Correlated column is not allowed in a non-equality predicate

Any suggestion will be really helpful.

>Solution :

If pyspark won’t allow > or >= in a correlated sub-query, just use the MIN() in the main query…

SELECT
  b.bucket_dates,
  MIN(t.dates)   AS final_dates
FROM
  buckets   AS b
LEFT JOIN
  tab1      AS t
    ON t.dates >= b.bucket_dates
GROUP BY
  b.bucket_dates