Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

remove specific endline breaks in Python

I have a long fasta file and I need to format the lines. I tried many things but since I’m not much familiar python I couldn’t solve exactly.

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I want them to look like:

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I’ve tried this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)

But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you

>Solution :

A common arrangement is to remove the newline, and then add it back when you see the next record.

# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
    # Keep track of whether we have written something without a newline
    written_lines = False
    for line in a_file:
        # Use standard .startswith()
        if line.startswith(">"):
            if written_lines:
                print()
                written_lines = False
            print(line, end='')
        else:
            print(line.rstrip('\n'), end='')
            written_lines = True
    if written_lines:
        print()

A common beginner bug is forgetting to add the final newline after falling off the end of the loop.

This simply prints one line at a time and doesn’t return anything. Probably a better design would be to collect and yield one FASTA record (header + sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that – BioPython seems to be the go-to solution for bioinformatics.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading