How do you grep/awk from a column in a file?

December 17, 2021

I have a file of IDs called IDs_list.txt that I want to use in order to extract information from a second file which has hundreds of IDs, many of which are not in my specific IDS_list.txt.

I’ve tried combinations of if and grep but my results keep coming up empty.

Here is an example of what I’m trying to do and what I’ve done.

cat IDS_list.txt | head -n 4
24
43
56
69

cat sample1.txt | head -n 4
NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_2_length_122550_cov_25.719,gi|84778498|dbj|AP008232.1|,122550,4171146,13,12690,93.693,0.0,23435,244,madeup species 2
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3
NODE_4_length_101672_cov_25.6536,gi|84778498|dbj|AP008232.1|,101672,4171146,7,4139,86.799,0.0,7644,955,long name here

The IDs are in the 10th column.

I will need to pull out all lines where the IDs are in the IDS_list.txt.

So my output should be:

NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3

I’ve tried:

for file in sample?.txt; do awk 'FNR==NR{arr[$0];next} ($10 in arr)' IDs_list.txt $file; done

Nothing comes out. This example I took from another stack overflow question.

for i in $(cat IDs_list.txt); do awk -F"," '$10 == $i' sample1.txt; done

But this will print a single output so many times because I am iterating over the IDs_list.txt line by line, so it is not what I want. I will get the first output line maybe hundreds of times because my IDs_list.txt has hundreds of IDs.

Then I tried grep with awk but that didn’t work either. My syntax is off.

for file in sample?.txt; do for i in $(cat IDs_list.txt); do grep -w '$i' $file; done; done

Nothing is output here. My logic is that for each sample file, I want to grep the lines that contain the ID that is found in the IDs_list.txt. However I don’t like not calling the specific 10th column because the IDs sometimes can show up in other columns that are not actually IDs.

Any eloquent way of doing this in a for loop with grep or awk or both somehow?

>Solution :

You may use this awk:

awk -F, 'NR==FNR {ids[$1]; next} $10 in ids' IDs_list.txt sample.txt

NODE_1_length_148512_cov_24.5066,gi|573017271|gb|CP006568.1|,148512,4513140,8,7289,86.545,0.0,13461,24,madeup species 1
NODE_3_length_103385_cov_25.9802,gi|84778498|dbj|AP008232.1|,103385,4171146,6,4243,88.782,0.0,7836,43,madeup species 3