Databricks spark sql, fact query, slow at showing data

August 23, 2023

using the following example on a large table:

pages = spark.sql('select * from table xx'), I found that the query runs in seconds, but as soon as I want to see the data with pages.show(n=10) it takes minutes to get the data to have a sample of that data. What is happening under the hood to be so slow.

the SQL (spark.sql) command takes < 1 second but the pages.show(n=10) takes minutes.

>Solution :

Spark does lazy evaluation so it won’t start actually executing the command (e.g. select * from table xx) until an ‘action’ is call (e.g. .show(), .write or display() in Databricks).

The part that is running <1 sec is the evaluation—it’s checking to see if the command can be executed, but not actually executing until an action.

Related reads on Transformation vs Actions with Spark: