Using awk command to compare values in separate rows and multiple columns?

October 1, 2024

I have a file with four columns:

text1  a1  a2   5
text2  b2  b8   10
text3  b9  b4   15
text3  b9  b4   25
text3  b9  b4   20
text4  h1  g8   50
text4  g1  k5   70
text4  g1  k5   80
text4  g1  k5   50
text5  y5  p3   25

I wanted the following result:

text1  a1  a2   5
text2  b2  b8   10
text3  b9  b4   25
text4  h1  g8   50
text4  g1  k5   80
text5  y5  p3   25

Remove duplicate value from rows that match:
The first, second and third columns are the same and in the fourth column take the highest value.

I tried it as follows:

awk '!x[$1]++' file.txt

>Solution :

You are only indexing on $1 but your question requires the key to be $1..$3, and obviously your attempt does nothing to pick the maximum value instead of the first value for that key.

If the values for a key are always adjacent, you can collect them until you reach the next key, and then print that with the maximum value.

awk 'k != $1 "_" $2 "_" $3 { 
    if(NR > 1) print v;
    k=$1 "_" $2 "_" $3; s = $4; v = $0; next }
  $4 > s { s = $4; v = $0 }
  END { print v }'  file.txt

We collect the first three columns in k and the maximum value for this key in s. The entire line which contained the maximum value is v so that we don’t have to assemble the key and the value back for printing it. The script generally prints a line for the previous key when it finds a new key, but then of course we also need to do that when we fall off the end of the file, so we do that in the END block.

If adjacent cells can’t be guaranteed, sorting the file and piping to Awk is probably easier than writing a better script, especially if you haven’t learned any Awk at all yet. (Though do spend an hour or two; it’s a good use of your time.)