Awk in do-while loop, read files with multiple columns

December 31, 2022

I’m writing a script which reads variables from the columns of one file, pattern matches with a column of another file, and prints whole lines from this other file to an output file. I think I’ve come close to a solution, but only column names in the file outputs are produced (NR==1 is written, nothing else).

while read -r value1 value2
do
        echo "$value2"
        awk -F "\t" 'NR==1; $2 == "$value1" { print $0 }' data.txt > "${value2}".out
done < "var_table.txt"

Here’s an example head of my data:

data.txt

V1  V2  V3  V4
1   dat-131.2   2071.49 3.11E-302
2   dat-219.1   744.48  3.11E-302
3   dat-120.2   13.29   3.11E-302
4   dat-577.1   1.71    3.11E-302
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302
…

var_table.txt

dat-501.16  VarA
dat-151.5   VarB
dat-518.9   VarC
dat-535.2   VarD
dat-49.0    VarE

The expected output would be to create a tsv file called VarD.out ("${value2}.out), populated with a header (NR==1), plus whole rows from data.txt corresponding to the value "dat-535.2" in column 2 (print $0 if $2 == "$value1").

VarD.out _|>

V1  V2  V3  V4
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302

I hope this makes sense and is somewhat replicable. I’ve looked at other answers but am not able to understand how to incorporate multiple columns of data. My guess is the problem is variable assignment because of multiple files, as $2 and $0 correspond to data from data.txt, and value1 and value2 refer to columns of var_table.txt? Any help is welcome.

Thanks

>Solution :

Fixing OP’s code to properly pass a bash variable to awk:

while read -r value1 value2
do
        echo "$value2"
        awk -F "\t" -v value="$value1" 'NR==1 || $2==value' data.txt > "${value2}".out
done < "var_table.txt"

Where:

-v value="$value1" – assigns the value of the bash variable value1 to the awk varaible value
NR==1 || $2==value – if first record of input file, or 2nd field equals the awk variable value (aka bash variable value1) then write to output file

This generates:

$ head Var*out
==> VarA.out <==
V1      V2      V3      V4

==> VarB.out <==
V1      V2      V3      V4

==> VarC.out <==
V1      V2      V3      V4

==> VarD.out <==
V1      V2      V3      V4
5       dat-535.2       16.07   3.11E-302
6       dat-535.2       12.99   3.11E-302
7       dat-535.2       2147.31 3.11E-302
8       dat-535.2       61.21   3.11E-302

==> VarE.out <==
V1      V2      V3      V4

A more efficient method where we pull the loop into awk:

awk '
BEGIN      { FS=OFS="\t" }
FNR==NR    { vars[$1]=$2; next }                 # 1st file: save contents to array vars[] using 1st field as the index
FNR==1     { for (i in vars)                     # 2nd file: write header row to each of our output files (ie, for each index from the vars[] array)
                 print $0 > (vars[i] ".out")
             next
           }
$2 in vars { print $0 > (vars[$2] ".out") }      # 2nd file: if 2nd field is an index in array vars[] then print the current line to the associated output file
' var_table.txt data.txt

This also generates:

==> VarA.out <==
V1      V2      V3      V4

==> VarB.out <==
V1      V2      V3      V4

==> VarC.out <==
V1      V2      V3      V4

==> VarD.out <==
V1      V2      V3      V4
5       dat-535.2       16.07   3.11E-302
6       dat-535.2       12.99   3.11E-302
7       dat-535.2       2147.31 3.11E-302
8       dat-535.2       61.21   3.11E-302

==> VarE.out <==
V1      V2      V3      V4