Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to Extract Clusters with Multiple Rows Using Bash Commands?

I’m trying to extract clusters from a text file using Bash commands. Each cluster is delineated by a line starting with >Cluster. I want to extract only those clusters with more than one data row within them. Here’s a simplified example of my input file:

>Cluster 199
0       2599aa, >CAD5117741.1... *
>Cluster 200
0       2579aa, >CAD5112262.1... *
>Cluster 201
0       2578aa, >CAD5116287.1... *
>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%
>Cluster 203
0       2573aa, >CAD5110750.1... *
>Cluster 204
0       2571aa, >CAD5116249.1... *
>Cluster 205
0       2558aa, >CAD5122682.1... *
>Cluster 206
0       2553aa, >CAD5126525.1... *
>Cluster 207
0       2551aa, >CAD5115834.1... *

In this example, I want to extract only Cluster 202 because it has more than one row of data within it. The desired output would be:

>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%

I’m currently using awk to process the file but struggling to figure out how to extract these clusters properly. Can someone guide me in accomplishing this task efficiently using Bash commands?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I attempted to use the following awk command:

awk '/^>Cluster/ {cluster=$0; count=0; next} {count++} count > 1 {print cluster; print} count == 0 {print cluster}'

When applied to the provided data, it produced the following output:

>Cluster 202 2 2369aa, >CAD5122866.1... at 100.00%

This output is incomplete, as it should include all lines within Cluster 202.

>Solution :

Here is a simple working script (just tested and without awk):

#!/bin/bash

function process_cluster {
    
    input_file=$1
    current_cluster=""
    data_rows=0

    while read -r line; do
        if [[ $line == ">Cluster "* ]]; then
            if [[ $data_rows -gt 1 ]]; then
                echo -e "$current_cluster" 
            fi
            
            current_cluster="$line"  
            data_rows=0
        else
            ((data_rows++))
            current_cluster="$current_cluster\n$line" 
        fi
    done < $input_file

    if [[ $data_rows -gt 1 ]]; then
        echo -e "$current_cluster"
    fi

}

process_cluster clusters.txt > output.txt

Remember to save your input data into a file named clusters.txt or change it into the above script.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading