Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.

However, when I run the following bash command (to merge all files keeping one header):

cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done

The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?

>Solution :

I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can’t you just load them one after the other?

But your problem is an infinite loop. Your wildcard (*.csv) includes the file you’re writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading