Merging CSVs into one sees exponentially bigger size

April 29, 2022

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.

However, when I run the following bash command (to merge all files keeping one header):

cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done

The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.

So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?

>Solution :

I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can’t you just load them one after the other?

But your problem is an infinite loop. Your wildcard (*.csv) includes the file you’re writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).