Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Can we force Python/Pandas to flush to disk immediately?

I have a setup where a python script (let’s call it test1.py) is spawning a subprocess which executes test2.py. In test2.py, I have some pandas operations which ultimately builds a dataframe test. The final step in test2.py is saving the dataframe to csv (test.to_csv('my_path')). On completion of test2.py, test1.py continues execution and the next step required is to load the same csv file created (i.e., test = pd.read_csv('my_path')).

Now, the issue is that Python is not flushing the buffer to disk, and therefore, when test1.py goes to read the csv file, I get a FileNotFoundError. Of course, if I stop the script, the file is saved to disk. Is there a way to force pandas to flush to disk immediately? I’ve read about using file.flush() and os.fsync(fd) – but this don’t seem to apply to my case since I’m not dealing with any file descriptors.

EDIT: Added a (significantly) simplified example

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

test1.py looks something like:

import subprocess


def main():
    cmd = ['python3', 'test2.py']
    output_bytes = subprocess.check_output(cmd, stderr=subprocess.STDOUT, timeout=900)
    output = output_bytes.decode('utf-8')
    # test2.py finished, so I want to read the csv
    df = pd.read_csv('my_path')


if __name__ == '__main__':
    main()

test2.py looks something like:

import pandas as pd
import numpy as np

def main():
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
    df.to_csv('my_path')

if __name__ == '__main__':
    main()

>Solution :

but this don’t seem to apply to my case since I’m not dealing with any
file descriptors.

You do not have to use filename as 1st argument for .to_csv, as pandas.DataFrame.to_csv docs says you might use

file-like object implementing a write() function.

therefore you can do something like this

import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
f = open("file.csv","w",newline="")
df.to_csv(f)
f.flush()
f.close()

Observe that if you open file in non-binary mode, then you need to disengage universal newlines.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading