Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Memory Error when parsing a large number of files

I am parsing 6k csv files to merge them into one. I need this for their joint analysis and training of the ML model. There are too many files and my computer ran out of memory by simply concatenating them.


S = ‘’

for f in csv_files:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# read the csv file

#df = df.append(pd.read_csv(f))

s = s + open(f, mode ='r').read()[32:] 



print(f)

file = open(‘bigdata.csv’, mode = ‘w’)

file.write(s)

file.close()


I need a way to create a single dataset from all files (60gb) for train my ML model

>Solution :

I believe this may help:

file = open('bigdata.csv', mode = 'w')

for f in csv_files:
    s = open(f, mode='r').read()[32:]
    file.write(s)

file.close()

In contrast, your origin code need at least the same memory as the size of the output file, which is 60gb and maybe larger than the memory of your computer.

However, if there is a single input file that can run out of your memory, then this method will also lose, in which case you may need to read each of csv files line by line and write them into the output file. I didn’t write that method because I’m not sure about your magic number 32.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading