I have a 4 gig CSV I am trying to search to get a sub set of the CSV. I have a file csv file that contains the keywords I’m searching for (these keywords will be in the first column on the big csv).
I tried this line but it ended up taking over an hour to finish. I needed to use tr to get rid of the windows return char.
LC_ALL=C grep -F -i -f <(tr -d '\r' < keywords.csv) big_csv.csv > output.csv
Are there anyway I could optimize this? Anything I’m missing? Would it be better to use awk or another tool for this? I even thought about sorting then splitting the big csv by the first row so when when i search I could then just search the keyword by the file name and then append that to a new file. Is there a best practice for this? I’m trying to make this as POSIX as possible
As requested here is some of the sample data.
ADLV,-1.741774,0.961072,-0.751392,-0.935572,-2.269994,1.081103,-0.831244,1.540083,0.474326,-1.322924,2.199037,-0.919939,0.641496,-0.584152,0.729028,0.608351,-0.522026,0.966026,-0.793949,-1.623368,1.16177,-0.642438,-0.675811,-0.214964,-2.263053,2.188642,0.302449,0.770106
There will be multiple entries of the first row.
There’s more data in the rows but it’s too long to post here.
the keyword file will be like this
ADLV
ADVG
At most the keywords.csv will have 1,000 keywords. They will all be 4 letters per keyword.
Here’s a gist with sample data https://gist.github.com/fishnibble/9d95658c352a1acab3cec3e965defb3f
>Solution :
Using any awk, it sounds like all you need is:
awk -F, 'NR==FNR{sub(/\r$/,""); keys[$1]; next} $1 in keys' keywords.csv big_csv.csv
The sub(/\r$/,"") is there because you had tr -d '\r' < keywords.csv in your code – if you don’t have DOS line endings in your keywords file then you don’t need that.
I see you also have -i in your grep command, though, which implies you need to make the matching case-insensitive – if so then this is what you need, still using any awk:
awk -F, '{key=tolower($1)} NR==FNR{sub(/\r$/,""); keys[key]; next} key in keys' keywords.csv big_csv.csv
The grep line you tried would not only take a long time to finish but it could also produce incorrect output as it’s searching across the whole line of big_csv, not just the first field, and so it would generate a false match if the keyword appeared in some other position in a line, and it would also generate a false match if the keyword you wanted happened to be a substring of some other keyword.