Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to write large file in Python efficiently

I need to convert a large (~100GB) file from XML format to jsonl format. I’m doing this in Python. I know I can read the XML file line by line by iterating over the file –

with open('file.xml', 'r') as file:
    for line in file:
        process_line(file)

Now that I can do this, how do I write the created jsonl data line by line without overwhelming my RAM? The structure of the XML file is such that there is one tag per line which is converted into one JSON object. Will doing something like this work?

with open('input.xml', 'r') as ip, open('output.jsonl', 'w') as op:
    for line in ip:
        op.write(process_line(line))

I’m guessing this will perform a write to the disk for every line? If so that would take forever. Is there any way I can write to disk in batches?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Python file handling already takes care of buffering I/O.

Just set the buffering parameter on your open() calls.
See https://docs.python.org/3/library/functions.html#open

There are of course a lot of other ways to speed up processing huge files efficiently, depending on your actual processing needs.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading