How to prevent filling up memory when hashing large files with xxhash?

I’m trying to calculate xxhash of video files using the following code:

def get_hash(file):
    with open(file, 'rb') as input_file:
        return xxhash.xxh3_64(input_file.read()).hexdigest()

Some of the files are larger than the amount of RAM on the machine. When hashing those files, the memory fills up, followed by swap filling up, at which point the process gets killed by the OS (I assume).
What is the correct way to handle these situations? Thank you!

>Solution :

Instead of hashing the entire contents in one go, read it in chunks and update the hash as you read. Once you’ve used a chunk, you can discard it.

from functools import partial

def get_hash(file):
    CHUNK_SIZE = 2 ** 32  # Or whatever you have memory to handle
    with open(file, 'rb') as input_file:
        x = xxhash.xxh3_64()
        for chunk in iter(partial(input_file.read, CHUNK_SIZE), b''):
            x.update(chunk)
        return x.hexdigest()
 

Leave a Reply