Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

grep the string and return the line with the highest value if no ties

Given the strings I have that are based on ASVs (like keys) from a different file, I want to return the ASV matches that have the highest value. However, there are sometimes ties which can be a problem so instead I want to return a message.

I am able to return the line with the highest value after sorting, but I don’t know how to account for ties properly.

Here is a snippet of my file called file.txt:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species,Hits
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,Anaerolineae,???,???,???,uncultured Anaerolineae,2
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,???,???,???,???,uncultured Chloroflexota,1
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,unidentified marine,1
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,Chloroflexia,Chloroflexales,Chloroflexaceae,Chloroflexus,uncultured Chloroflexus,1
74627d6dc445e8b5f46a787cf81c4294,Bacteria,Pseudomonadota,Gammaproteobacteria,Legionellales,Legionellaceae,???,uncultured Legionellaceae,2
74627d6dc445e8b5f46a787cf81c4294,Bacteria,???,???,???,???,???,uncultured bacterium,5
74627d6dc445e8b5f46a787cf81c4294,Bacteria,Pseudomonadota,Gammaproteobacteria,Legionellales,Legionellaceae,Legionella,Legionella sp.,3
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma utsteinense,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,???,???,???,???,???,uncultured bacterium,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma sp.,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma rigui,6
8964b7d833654ceedbdb6f6f25fb7d6a,Bacteria,???,???,???,???,???,uncultured bacterium,8
8964b7d833654ceedbdb6f6f25fb7d6a,Bacteria,Bacillota,Tissierellia,Tissierellales,Peptoniphilaceae,Finegoldia,Finegoldia magna,1
8964b7d833654ceedbdb6f6f25fb7d6a,???,???,???,???,???,???,uncultured organism,1
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Sphingobacteriia,Sphingobacteriales,???,???,uncultured Cytophagales,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma jeollabukense,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma migulaei,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma sp.,1

As an example, searching for the ASV 29ec61e470705074f483368a70ad18a7 and returning the match with the highest value (the very last column) is easy:

Code:

> grep 29ec61e470705074f483368a70ad18a7 file.txt | sort -t, -nr -k9 | head -n1

# Output
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

But if I am searching an ASV such as 9966f0e6e452c31de46d030bab01fdd9, I would need it to somehow return or know that three lines can be returned (3 of them have the value 2) and instead output a message:

Ideal output:

  > grep 9966f0e6e452c31de46d030bab01fdd9 file.txt | does something

  # Output
  CHECK: There are 3 lines tied for top.

>Solution :

I would opt for an awk solution:

  • eliminates need for sorting
  • eliminates overhead of subshells
  • easy to add desired logic

One awk idea:

awk -F, -v asv="29ec61e470705074f483368a70ad18a7" '
$1 == asv { if ($NF >= max_val) {
               if ($NF > max_val) {
                  delete matches
                  cnt = 0
                  max_val = $NF
               }
               matches[++cnt] = $0
            }
          }
END       { if (cnt == 0)
               print "WARNING: No lines found."
            else 
            if (cnt==1)
               print matches[cnt]
            else
               printf "CHECK: There are %s lines tied for top [max value = %s].\n", cnt, max_val
          }
' file.txt

For asv="123" this generates:

WARNING: No lines found.

For asv="29ec61e470705074f483368a70ad18a7" this generates:

29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

For asv="9966f0e6e452c31de46d030bab01fdd9" this generates:

CHECK: There are 3 lines tied for top [max value = 2].

Wrapping this in a bash function:

max_asv() {
awk -F, -v asv="$1" '
$1 == asv { if ($NF >= max_val) {
               if ($NF > max_val) {
                  delete matches
                  cnt = 0
                  max_val = $NF
               }
               matches[++cnt] = $0
            }
          }
END       { if (cnt == 0)
               print "WARNING: No lines found."
            else 
            if (cnt==1)
               print matches[cnt]
            else
               printf "CHECK: There are %s lines tied for top [max value = %s].\n", cnt, max_val
          }
' "$2"
}

Taking the function for a test drive:

$ max_asv 123 file.txt
WARNING: No lines found.

$ max_asv 29ec61e470705074f483368a70ad18a7 file.txt
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

$ max_asv 9966f0e6e452c31de46d030bab01fdd9 file.txt
CHECK: There are 3 lines tied for top [max value = 2].
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading