Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why metadata is written at the end of the file in Apache Parquet?

I wonder why Apache Parquet writes metadata at the end of the file instead of the beginning?

In the official documentation of Apache Parquet, I found that Metadata is written after the data to allow for single pass writing.. Is the metadata written at the end to ensure the integrity of the file? I don’t understand what this sentence really means, if someone could explain it to me, I’d appreciate it.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

I think the main reason is so you can write bigger than memory data to the same file.

The meta data contains information about the schema of the data (type of the columns) and its shape (number of row groups, size of each row groups).

So in order to generate the metadata you need to know what the data is made of. This can be a problem if your data doesn’t fit in memory.

In this case, you should still be able to split your data in manageable row groups (that fit in memory) and append them to the file one by one, keeping track of the meta data, and appending the meta data at the end.

import pyarrow as pa
import pyarrow.parquet as pq


schema = pa.schema([pa.field("col1", pa.int32())])

with pq.ParquetWriter("table.parquet", schema=schema) as file:
    for i in range(0, 10):
        file.write(pa.table({"col1": [i] * 10}, schema=schema))

If you’re looking for an alternative where the data can be streamed, with the meta data being written at the beginning, you should look at the arrow IPC format.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading