I have a table, in a database. There is a column "id" and to each id there are multiple rows.
I want to randomly pick 10 ids and then load all the rows with these ids.
The following is working and gives me a table with 10 random ids. That’s fine.
distinct_ids = spark.sql(f""" (SELECT DISTINCT id FROM {database_name}.{orig_table_cl_name} ORDER BY RAND() LIMIT 10)
But now I am not getting how to do an inner join with the original table to get all the data containing the right ids…
I tried:
distinct_ids = spark.sql(
f""" (SELECT DISTINCT vehicle_id
FROM {database_name}.{orig_table_name}
ORDER BY RAND() LIMIT 10)
AS table
INNER JOIN table
ON {database_name}.{orig_table_name}.id = table.id""")
Giving me following error:
ParseException:
mismatched input 'AS' expecting {<EOF>, ';'}(line 1, pos 99)
== SQL ==
(SELECT DISTINCT vehicle_id FROM pnds12v_dev_core.t_cycle_log_car_v2_0 ORDER BY RAND() LIMIT 10) AS table
I tried a bunch of other orders of the arguments but figured I am missing the general understanding of how this works… and since I thought there are no semicolons when using spark.sql the error message does not give me any hints.
Can someone explain to me how to do such a query?
>Solution :
Use common table expression (CTE) to define a temporary result set.
distinct_ids = spark.sql(
f"""with table as
(SELECT DISTINCT vehicle_id as id
FROM {database_name}.{orig_table_name}
ORDER BY RAND() LIMIT 10)
select * from table INNER JOIN {database_name}.{orig_table_name}
ON {database_name}.{orig_table_name}.id = table.id""")