Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Sort according to two columns and extract top two based on last column

I have a file with three columns. I would like to extract rows with top two values in column 3 for each unique value in column 2.

cat file.list
run1/xx2/x2c1.txt 21 -190
run1/xx2/x2c2.txt 19 -180
run1/xx2/x2c3.txt 18 -179
run1/xx2/x2c4.txt 19 -162
run1/xx2/x2c5.txt 21 -172
run2/xx2/x2c1.txt 21 -162
run2/xx2/x2c2.txt 18 -192
run2/xx2/x2c3.txt 19 -191
run2/xx2/x2c4.txt 19 -184
run2/xx2/x2c5.txt 21 -179
run3/xx2/x2c1.txt 19 -162
run3/xx2/x2c2.txt 19 -192
run3/xx2/x2c3.txt 21 -191
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c5.txt 19 -179

expected output

run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190

I feel like some combination of sort, uniq and awk might accomplish but I can’t properly execute it. I can sort by columns

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

sort -nk2 -nk3 file.list 

which gives me an output sorted by -k2 and -k3 as follows,

run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run1/xx2/x2c3.txt 18 -179
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run2/xx2/x2c4.txt 19 -184
run1/xx2/x2c2.txt 19 -180
run3/xx2/x2c5.txt 19 -179
run1/xx2/x2c4.txt 19 -162
run3/xx2/x2c1.txt 19 -162
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190
run2/xx2/x2c5.txt 21 -179
run1/xx2/x2c5.txt 21 -172
run2/xx2/x2c1.txt 21 -162

but then I get stuck on how to extract only the rows with best two scores in the last column for 18, 19 and 20.

I would really appreciate any bash solutions.

>Solution :

Piping the current sort results to awk:

$ sort -nk2 -nk3 file.list | awk 'a[$2]++ < 2'
run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190

Where:

  • field #2 ($2) is used as the index for array a[]
  • if the value stored in the array is less than 2 then print the current input line
  • then increment the counter (++)
  • 1st time we see a[18] the count is 0, we print the line, and increment the count by 1
  • 2nd time we see a[18] the count is 1, we print the line, and increment the count by 1
  • 3rd (to nth) time we see a[18] the count is greater than or equal to 2, we do not print the line, and increment the count

An alternative where we increment the count first:

$ sort -nk2 -nk3 file.list | awk '++a[$2] <= 2'
run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading