Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Count reads in fasta file (loop)

I would like to count the number of reads in a fasta file using awk. A read in a fasta file starts with "> NAME" followed with a line and "DNA code". In my case, fasta files contain at least 1 read (thus not empty). I also have a lot of fasta files and I would like to loop this. Therefore I would also like to copy the name of the file into the output and paste the number of reads next to the name of the file in the output file.

Fasta file example (File 1.fasta)

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

Fasta file example (File 2.fasta)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
>sequence C
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

I’ve tried multiple scripts already

Script 1

#!/bin/bash
for file1 in ~/Test/*.fasta
do
outfile1=${file1}_readcount.txt
awk -F ' ' -v out1=$outfile1 '{
if(NR==1) {lines[0]=lines[0] OFS FILENAME  > out1;
}
if(FNR==NR) {grep -c "^>" $file1 > out1;
}
}'  $file1
done

It didn’t give an error but also no output

Script 2

awk '
BEGIN   { OFS="\t" } #output field delimiter is tab

FNR==1  { lines[0]=lines[0] OFS FILENAME } #append filename to header record

FNR==NR {grep -c "^>" FILENAME } # counts number of ">" at the beginning of lines

END     { for (i=0;i<=FNR;i++) #loop through the line numbers
              print lines[i]
        } #printing each line 
' *fasta > countreads.txt

Here I only got the header in the file and thousands of empty rows.

Expected ouput that I would like to get

File1   2
File2   3

>Solution :

If you mean to count each /^>/ as a sequence and then summarize sequence counts by filenames, you can do this in awk:

awk 'FNR==1{if (name) print name, cnt; name=FILENAME; cnt=0}
/^>/{cnt++}
END{print name, cnt}' *.fasta

This also works but the file names may be in a different order than awk read them:

awk '/^>/{file_cnt[FILENAME]++}
END{for (fn in file_cnt) print fn, file_cnt[fn]}' *.fasta

You can also just use grep directly to count the matches:

grep -c "^>" *.fasta # that will be ':' delimited. 
                     # You can use sed, tr, awk to 
                     # change the ':' to a '\t` or whatever.

In summary, use awk if you want to customize what is counted or printed. Use grep with a glob if you just want to count the total regex matches and sumarize by file name.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading