- ⚠️ ZSTD's
content_sizeis not required and often not there when streaming or using default compression. - 💡 Python-zstandard gives back
-1when size info is missing forcontent_size. - 🛠️ Turn on
write_content_size=Truein python-zstandard. This adds about 8 bytes but makes sure size info is there. - 📊 Files with many compressed parts make it hard to figure out the total uncompressed size.
- 🔍 You often have to check by hand or decompress the file to find the real content size.
Zstandard (ZSTD) is known for fast and good compression in today's data systems. Developers using python-zstandard, a common tool for ZSTD, often have trouble getting the original data size with the content_size attribute. This content_size often shows up as unknown when dealing with files that have many compressed parts or with data streams. This article tells you why this happens. It also shows how ZSTD manages size info and what you can do to handle content size better in your Python programs.
Understanding ZSTD Compression and content_size Metadata
Zstandard (ZSTD) is a fast, new way to compress data without losing any. Facebook made it to compress a lot and do it fast. One of ZSTD's main good points is how it uses frames. This makes things flexible but also makes size info harder to handle.
What is a Frame?
In ZSTD, a frame is the smallest piece of compressed data that works on its own. Each frame has:
- A header
- Optional size info (like
content_size) - The compressed block(s)
- Also, sometimes, a checksum
ZSTD can work with just one frame or put many frames together into one file or stream. When you compress data in small parts, like each line of a log file, you usually end up with multiframe files.
Role of content_size
The content_size part in the ZSTD frame header, when it's there, tells you the exact size of the original data for that frame. This is very helpful for:
- Setting aside memory before decompressing
- Checking if data is good
- Watching how decompressing big files is going
- Making stream processing work better
But here's the thing: this size info is not always required. The ZSTD frame format specification says that turning off content_size saves 8 bytes of size info in each frame. And that can add up fast, especially in streams with many frames.
Why content_size Can Be Unknown
If you use python-zstandard and see that .content_size gives back -1 (which means zstandard.CONTENTSIZE_UNKNOWN), you are not hitting a bug. It simply shows how the data was compressed.
Reasons content_size Might Be Omitted:
- Not There by Default: ZSTD does not write
content_sizeunless you tell it to. - Streaming Mode Compression: When you don't know the data size while compressing (which is common for streamed data), the field is turned off. This saves space and makes things more flexible.
- Multiframe Files: If any single frame leaves out its size, you cannot easily guess the total size from just the headers.
- Command Line Tools: Command line tools like
zstddo not put incontent_sizeunless you add the--content-sizeflag. - Library Settings:
python-zstandardand many other libraries, by default, do not write this info unless you tell them to.
Consequences of Missing content_size:
- Hard to guess how much memory is needed for decompression
- Progress bars or logs might not be right while processing
- File checks that use size might not be correct
- You have to do extra work to get the full uncompressed size
Knowing when and why content_size might not be there helps stop confusion and time spent fixing problems that aren't really problems.
Multiframe Files and the content_size Challenge
When you use single-frame ZSTD files that have content_size, you can read the size using tools or Python code that checks the header. Everything works fine.
But when you work with multiframe files, things are harder.
Each frame is treated as its own compressed part. They are often written one by one in data streams and then joined together. Here is what you will face:
- Size info not always there: One frame might have it, but another might not.
- No Total Size Info: ZSTD does not keep a total uncompressed size for the whole file. So, you cannot just check the header to find the overall size.
- You Must Calculate It: To get the total size, you either have to decompress the data. Or, you can check each frame by itself and then add up the sizes you know.
This makes setting aside memory beforehand hard or even impossible in some cases. This is true especially on the server side or when working with big files.
How Python-zstandard Handles content_size
The python-zstandard library links Python to the ZSTD C API. It gives you a simple way to use both compression and decompression.
Reading with Python-Zstandard
For normal decompression, you can use two main ways:
1. Decompress Entire Payload
dctx = zstandard.ZstdDecompressor()
output = dctx.decompress(compressed_data)
With this way of doing things:
- Works well for one frame
- Gives back all the decompressed bytes
- You can often find the content size after decompressing by using
len(output)
If compressed_data has content_size, you can also ask for it directly:
frame_params = dctx.get_frame_parameters(compressed_data)
print(frame_params.content_size)
2. Streamed Decompression
with open("multi.zst", "rb") as f:
dctx = zstandard.ZstdDecompressor()
reader = dctx.stream_reader(f)
...
For files with many frames or big data systems:
- Uses memory well
- Does not break with big inputs
- You cannot trust
.content_sizeunless all frames say what it is
get_frame_parameters() for Inspecting Individual Frames
This method lets you get the FrameParameters object. It has:
content_size- If a checksum is there
- Window size
- Frame type
But keep in mind: zstandard.CONTENTSIZE_UNKNOWN (-1) will come back for any frame that left it out during compression.
Common Developer Pitfalls
Many Python developers who don't know ZSTD's inner workings often make the same mistakes:
- 🚫 Thinking content size is always there: It is not. If
write_content_size=Trueis not used, it is not there. - ⚠️ Taking
-1for an error: It simply means 'unknown.' It is not a failure. - 📦 Not caring about frame structure at all: In files with many frames, treating the file as one big piece does not work well when size info is thin.
Avoiding these thoughts will make your programs stronger and less likely to break.
Solutions and Workarounds
Here are real ways for getting around or fixing content_size issues with python-zstandard:
✅ 1. Use write_content_size=True During Compression
Make sure this flag is on if keeping track of size is important.
import zstandard
data = b"your data here"
cctx = zstandard.ZstdCompressor(write_content_size=True)
compressed_data = cctx.compress(data)
This puts the uncompressed size into the frame's size info. Then it is easy to get later.
📏 2. Compute Size Manually for Decompression
When reading a file stream:
with open("data.zst", "rb") as f:
dctx = zstandard.ZstdDecompressor()
reader = dctx.stream_reader(f)
total_size = 0
while True:
chunk = reader.read(1024 * 16)
if not chunk:
break
total_size += len(chunk)
This method makes sure the size is right even when size info is not there.
🔍 3. Parse Frame Headers Individually
Scan frames one by one and add up known content_size values:
with open("multi.zst", "rb") as f:
dctx = zstandard.ZstdDecompressor()
while True:
frame = f.read(1024)
if not frame:
break
try:
params = dctx.get_frame_parameters(frame)
print("Frame content size:", params.content_size)
except Exception:
continue
This way is not perfect. Big frames might spread across many reads. But it works well with ways of working that know about frames.
Demonstrating with Code Examples
Example 1: Writing ZSTD with content_size
import zstandard
data = b"example data"
cctx = zstandard.ZstdCompressor(write_content_size=True)
compressed = cctx.compress(data)
Example 2: Manually Summing Sizes from Stream
with open("multi.zst", "rb") as f:
dctx = zstandard.ZstdDecompressor()
reader = dctx.stream_reader(f)
total_size = 0
while True:
chunk = reader.read(16384)
if not chunk:
break
total_size += len(chunk)
print("Total uncompressed size:", total_size)
Example 3: Conditional Decompression Based on content_size
frame_params = dctx.get_frame_parameters(compressed_data)
if frame_params.content_size == zstandard.CONTENTSIZE_UNKNOWN:
decompressed = dctx.decompress(compressed_data)
length = len(decompressed)
else:
length = frame_params.content_size
Pros & Cons of Embedding content_size
| Feature | Pros | Cons |
|---|---|---|
content_size included |
Easy checks, tracking, and memory setup | Adds ~8 bytes overhead per frame |
content_size omitted |
More compact file, good for thin or unknown streams | Loses visibility, needs decompression to get size |
Choose based on your priority: checking info or keeping file size small.
Alternative Tools and Libraries
If python-zstandard does not do what you need, try these other options:
-
Zstandard CLI: It has
--content-sizeto make sure size info is written down.zstd --content-size input.txt -
C API: You get full control with code. This is good for programs where speed matters a lot.
-
Rust Ecosystem: Tools like
zstd-frame-analyzercheck frames fast and give out detailed size info.
Comparing gzip and ZSTD content size handling
Gzip is made to store the uncompressed size at the end. This makes it easier to get. But:
| Feature | gzip | ZSTD |
|---|---|---|
| Size always there | ✅ | ❌ |
| Many frames work | ❌ | ✅ |
| Streaming | Limited | Made for it |
| How fast it decompresses | Slower | Faster (up to 50% faster, says Saltaré, 2021) |
Gzip gives reliable size info. But ZSTD is better for speed, new design, and how flexible it is.
Best Practices for ZSTD Compression in Python
- 📝 Be clear: Use
write_content_size=Trueif you will need the original size later. - 📦 Use one-frame compression if you need to see the size easily.
- 🔄 Decompress to find the size when size info is not there.
- 🧪 Check tricky cases with many frames before using it for many users.
- 📊 Plan for memory carefully in systems that use streams.
What Developers Should Remember
When using zstd compression in Python, especially with python-zstandard, learning how content_size works, and when it's not there, can stop problems and wasted effort. Do not see -1 as a failure. See it as a sign to try other ways. If knowing the size for sure is very important for your program, always set up compression the right way or figure out what you need while decompressing.
Further Reading and Resources
- Zstandard format documentation (Facebook)
- Python-zstandard GitHub and documentation
- ZSTD metadata explanation
- Saltaré compression benchmarking blog
If you set it up well and really understand how it works, ZSTD is one of the best tools for fast, large-scale compression in today's Python programs.
Citations
- Facebook. (2020). Zstandard Frame Format Specification. Retrieved from https://facebook.github.io/zstd/zstd_manual.html
- Python-zstandard. (2023). GitHub Repository. Retrieved from https://github.com/indygreg/python-zstandard
- Saltaré, J. (2021). ZSTD vs GZIP Compression Benchmarks. Retrieved from https://saltares.com/blog/zstd-vs-gzip-compression-benchmarks/