using the following example on a large table:
pages = spark.sql('select * from table xx'), I found that the query runs in seconds, but as soon as I want to see the data with pages.show(n=10) it takes minutes to get the data to have a sample of that data. What is happening under the hood to be so slow.
the SQL (spark.sql) command takes < 1 second but the pages.show(n=10) takes minutes.
>Solution :
Spark does lazy evaluation so it won’t start actually executing the command (e.g. select * from table xx) until an ‘action’ is call (e.g. .show(), .write or display() in Databricks).
The part that is running <1 sec is the evaluation—it’s checking to see if the command can be executed, but not actually executing until an action.
Related reads on Transformation vs Actions with Spark:
- https://www.linkedin.com/pulse/spark-transformations-actions-lazy-evaluation-mohammad-younus-jameel/
- Spark Transformation – Why is it lazy and what is the advantage?