How to Extract Clusters with Multiple Rows Using Bash Commands?

April 2, 2024

I’m trying to extract clusters from a text file using Bash commands. Each cluster is delineated by a line starting with >Cluster. I want to extract only those clusters with more than one data row within them. Here’s a simplified example of my input file:

>Cluster 199
0       2599aa, >CAD5117741.1... *
>Cluster 200
0       2579aa, >CAD5112262.1... *
>Cluster 201
0       2578aa, >CAD5116287.1... *
>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%
>Cluster 203
0       2573aa, >CAD5110750.1... *
>Cluster 204
0       2571aa, >CAD5116249.1... *
>Cluster 205
0       2558aa, >CAD5122682.1... *
>Cluster 206
0       2553aa, >CAD5126525.1... *
>Cluster 207
0       2551aa, >CAD5115834.1... *

In this example, I want to extract only Cluster 202 because it has more than one row of data within it. The desired output would be:

>Cluster 202
0       2578aa, >CAD5122864.1... *
1       1867aa, >CAD5122865.1... at 100.00%
2       2369aa, >CAD5122866.1... at 100.00%

I’m currently using awk to process the file but struggling to figure out how to extract these clusters properly. Can someone guide me in accomplishing this task efficiently using Bash commands?

I attempted to use the following awk command:

awk '/^>Cluster/ {cluster=$0; count=0; next} {count++} count > 1 {print cluster; print} count == 0 {print cluster}'

When applied to the provided data, it produced the following output:

>Cluster 202 2 2369aa, >CAD5122866.1... at 100.00%

This output is incomplete, as it should include all lines within Cluster 202.

>Solution :

Here is a simple working script (just tested and without awk):

#!/bin/bash

function process_cluster {
    
    input_file=$1
    current_cluster=""
    data_rows=0

    while read -r line; do
        if [[ $line == ">Cluster "* ]]; then
            if [[ $data_rows -gt 1 ]]; then
                echo -e "$current_cluster" 
            fi
            
            current_cluster="$line"  
            data_rows=0
        else
            ((data_rows++))
            current_cluster="$current_cluster\n$line" 
        fi
    done < $input_file

    if [[ $data_rows -gt 1 ]]; then
        echo -e "$current_cluster"
    fi

}

process_cluster clusters.txt > output.txt

Remember to save your input data into a file named clusters.txt or change it into the above script.