I’m writing a script which reads variables from the columns of one file, pattern matches with a column of another file, and prints whole lines from this other file to an output file. I think I’ve come close to a solution, but only column names in the file outputs are produced (NR==1 is written, nothing else).
while read -r value1 value2
do
echo "$value2"
awk -F "\t" 'NR==1; $2 == "$value1" { print $0 }' data.txt > "${value2}".out
done < "var_table.txt"
Here’s an example head of my data:
data.txt
V1 V2 V3 V4
1 dat-131.2 2071.49 3.11E-302
2 dat-219.1 744.48 3.11E-302
3 dat-120.2 13.29 3.11E-302
4 dat-577.1 1.71 3.11E-302
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
…
var_table.txt
dat-501.16 VarA
dat-151.5 VarB
dat-518.9 VarC
dat-535.2 VarD
dat-49.0 VarE
The expected output would be to create a tsv file called VarD.out ("${value2}.out), populated with a header (NR==1), plus whole rows from data.txt corresponding to the value "dat-535.2" in column 2 (print $0 if $2 == "$value1").
VarD.out _|>
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
I hope this makes sense and is somewhat replicable. I’ve looked at other answers but am not able to understand how to incorporate multiple columns of data. My guess is the problem is variable assignment because of multiple files, as $2 and $0 correspond to data from data.txt, and value1 and value2 refer to columns of var_table.txt? Any help is welcome.
Thanks
>Solution :
Fixing OP’s code to properly pass a bash variable to awk:
while read -r value1 value2
do
echo "$value2"
awk -F "\t" -v value="$value1" 'NR==1 || $2==value' data.txt > "${value2}".out
done < "var_table.txt"
Where:
-v value="$value1"– assigns the value of thebashvariablevalue1to theawkvaraiblevalueNR==1 || $2==value– if first record of input file, or 2nd field equals theawkvariablevalue(akabashvariablevalue1) then write to output file
This generates:
$ head Var*out
==> VarA.out <==
V1 V2 V3 V4
==> VarB.out <==
V1 V2 V3 V4
==> VarC.out <==
V1 V2 V3 V4
==> VarD.out <==
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
==> VarE.out <==
V1 V2 V3 V4
A more efficient method where we pull the loop into awk:
awk '
BEGIN { FS=OFS="\t" }
FNR==NR { vars[$1]=$2; next } # 1st file: save contents to array vars[] using 1st field as the index
FNR==1 { for (i in vars) # 2nd file: write header row to each of our output files (ie, for each index from the vars[] array)
print $0 > (vars[i] ".out")
next
}
$2 in vars { print $0 > (vars[$2] ".out") } # 2nd file: if 2nd field is an index in array vars[] then print the current line to the associated output file
' var_table.txt data.txt
This also generates:
==> VarA.out <==
V1 V2 V3 V4
==> VarB.out <==
V1 V2 V3 V4
==> VarC.out <==
V1 V2 V3 V4
==> VarD.out <==
V1 V2 V3 V4
5 dat-535.2 16.07 3.11E-302
6 dat-535.2 12.99 3.11E-302
7 dat-535.2 2147.31 3.11E-302
8 dat-535.2 61.21 3.11E-302
==> VarE.out <==
V1 V2 V3 V4