Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Spark partitioning of related data into row groups

With Apache Spark we can partition a dataframe into separate files when saving into Parquet format.

In the way Parquet files are written, each partition contains multiple row groups each of include column statistics pertaining to each group (e.g., min/max values, as well as number of NULL values).

Now, it would seem ideal in some situations to organize the Parquet file such that related data appears together in one or more row groups. This would be a secondary level of partitioning within each partition file (which constitutes the first level).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This is possible using for example pyarrow, but how can we do this with a distributed SQL engine such as Spark?

>Solution :

Besides partitioning you can order your data to group related data together in a limited set of partitions. Statement from Databricks:

Z-Ordering is a technique to colocate related information in the same
set of files

(
    df
    .write.option("header", True)
    .orderBy(df.col_1.desc())
    .partitionBy("col_2")
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading