Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

append modified ID to fasta file ID

I have a file that looks like this:

>1_CCACT_1/1
CCATCATTGGCGTCTACA
>2_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC

And I want to make it look like this:

>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC

Where the first 1 is original, followed by a #, then the second number is from here (in bold):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

5_ATATC_1

Followed by a #, and its followed by this barcode (in bold):

5_ATATC_1

I’m using the last entry just as an example. I have some messy sed scripts that can produce the desired header (sort of) but I can’t figure out how to append them back to the original headers. You can’t assume that the second number will always be a 1, but you can assume that the order of the file won’t change. Open to solutions in any programming language, though I’ve only tried in bash.

>Solution :

One sed using capture groups:

$ sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC

Once satisfied with the result add the -i flag to overwrite the input file:

$ sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
$ cat fasta.dat
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading