remove specific endline breaks in Python

December 20, 2021

I have a long fasta file and I need to format the lines. I tried many things but since I’m not much familiar python I couldn’t solve exactly.

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I want them to look like:

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I’ve tried this:

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)

But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you

>Solution :

A common arrangement is to remove the newline, and then add it back when you see the next record.

# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
    # Keep track of whether we have written something without a newline
    written_lines = False
    for line in a_file:
        # Use standard .startswith()
        if line.startswith(">"):
            if written_lines:
                print()
                written_lines = False
            print(line, end='')
        else:
            print(line.rstrip('\n'), end='')
            written_lines = True
    if written_lines:
        print()

A common beginner bug is forgetting to add the final newline after falling off the end of the loop.

This simply prints one line at a time and doesn’t return anything. Probably a better design would be to collect and yield one FASTA record (header + sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that – BioPython seems to be the go-to solution for bioinformatics.