Spark SQL: Select 10 randomly chosen groups of value in column of table

November 10, 2021

I have a table, in a database. There is a column "id" and to each id there are multiple rows.
I want to randomly pick 10 ids and then load all the rows with these ids.
The following is working and gives me a table with 10 random ids. That’s fine.

distinct_ids = spark.sql(f""" (SELECT DISTINCT id  FROM {database_name}.{orig_table_cl_name} ORDER BY RAND() LIMIT 10)

But now I am not getting how to do an inner join with the original table to get all the data containing the right ids…
I tried:

distinct_ids = spark.sql(
f""" (SELECT DISTINCT vehicle_id 
FROM {database_name}.{orig_table_name} 
ORDER BY RAND() LIMIT 10) 
AS table
INNER JOIN table
ON {database_name}.{orig_table_name}.id = table.id""")

Giving me following error:

ParseException: 
mismatched input 'AS' expecting {<EOF>, ';'}(line 1, pos 99)

== SQL ==
 (SELECT DISTINCT vehicle_id FROM pnds12v_dev_core.t_cycle_log_car_v2_0  ORDER BY RAND() LIMIT 10) AS table

I tried a bunch of other orders of the arguments but figured I am missing the general understanding of how this works… and since I thought there are no semicolons when using spark.sql the error message does not give me any hints.

Can someone explain to me how to do such a query?

>Solution :

Use common table expression (CTE) to define a temporary result set.

distinct_ids = spark.sql(
f"""with table as
    (SELECT DISTINCT vehicle_id as id
    FROM {database_name}.{orig_table_name} 
    ORDER BY RAND() LIMIT 10)
select * from table INNER JOIN {database_name}.{orig_table_name}
ON {database_name}.{orig_table_name}.id = table.id""")