Spark partitioning of related data into row groups

January 14, 2022

With Apache Spark we can partition a dataframe into separate files when saving into Parquet format.

In the way Parquet files are written, each partition contains multiple row groups each of include column statistics pertaining to each group (e.g., min/max values, as well as number of NULL values).

Now, it would seem ideal in some situations to organize the Parquet file such that related data appears together in one or more row groups. This would be a secondary level of partitioning within each partition file (which constitutes the first level).

This is possible using for example pyarrow, but how can we do this with a distributed SQL engine such as Spark?

>Solution :

Besides partitioning you can order your data to group related data together in a limited set of partitions. Statement from Databricks:

Z-Ordering is a technique to colocate related information in the same
set of files

(
    df
    .write.option("header", True)
    .orderBy(df.col_1.desc())
    .partitionBy("col_2")
)

parquet

byMR

Published January 14, 2022

Add a comment

Hide/show toggle inside password input

byMR

January 14, 2022

Questions

Calculate how many items within a range

byMR

January 14, 2022

Questions

Missing return type on function useAppDispatch

byMR

January 14, 2022

Questions

select from table of varchar2

byMR

January 14, 2022

Questions

Python : How to use lambda function to return a specific value?

byMR

January 14, 2022

Questions

Flutter – How to handle Non-Nullable Instance of widget properties

byMR

January 14, 2022

Spark partitioning of related data into row groups