I am creating an AWK program that I named join. It will be used to join two files via a composite key.
The program is run by passing it two files, file1 and file2, and two variables, flcols ("file1 columns") and f2cols ("file2 columns"). The value of flcols and f2cols is a comma-separated list of numbers. The numbers identify fields, e.g., f1cols='1,2,3,4' means fields $1, $2, $3, $4 in file1. Here are a couple examples of invoking the program:
join -v f1cols='1,2,3,4' -v f2cols='2,3,4,5' file2 file1
join -v f1cols='1,3,5' -v f2cols='1,2,3' file2 file1
I want to store the content of file2 in an array named a. The subscripts of a are to be the concatenation of the values of the fields identified by f2cols. So, if the program is invoked like this:
join -v f1cols='1,2,3,4' -v f2cols='2,3,4,5' file2 file1
then the subscript should be:
a[$2,$3,$4,$5]
If the program is invoked like this:
join -v f1cols='1,3,5' -v f2cols='1,2,3' file2 file1
then the subscript should be:
a[$1,$2,$3]
To generalize the problem statement:
Given this command-line argument:
f2cols='x1,x2,...,xn'
where xi is a non-negative integer.
In the AWK program create a subscript:
a[$x1,$x2,...,$xn]
The subscript is the string resulting from concatenating the values of fields $x1,$x2,…,$xn.
I have no idea how to create such subscripts. A little help please.
>Solution :
Sample inputs:
$ head file1 file2
==> file1 <==
as,df,as,df,sd,f
1,a,2,b,3,4,5,6,7,8,9
x,xxx,y,yyy,z,a,b,c
==> file2 <==
a,b,c,d,e,f
g,h,j,k,l,m
1,2,3,4,5,6
x,y,z,a,b,c
One awk approach:
awk -v f1cols='1,3,5' -v f2cols='1,2,3' '
BEGIN { FS=OFS=","
m=split(f1cols,f1,",")
n=split(f2cols,f2,",")
}
FNR==NR { idx=$(f2[1])
for (i=2;i<=n;i++)
idx=idx FS $(f2[i])
arr[idx]=$0
next
}
{ idx=$(f1[1])
for (i=2;i<=m;i++)
idx=idx FS $(f1[i])
if (idx in arr)
print "found index: " idx
}
' file2 file1
This generates:
found index: 1,2,3
found index: x,y,z