I have a python code and I’m working with a list of sequences
seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
nucleotides = ['G','A','C','T', 'U']
if any(x not in nucleotides for x in NTs):
print("ERROR: non-nucleotide characters present")
so this works so far and it does tell me if there are non-nucleotide characters present, but I also need the code to remove those non-nucleotide characters so I can further process the sequences. I was wondering how I would do this
>Solution :
Line 3 and 5 has two different way of doing it. Line 3: gather all the characters that are also present in nucleotides irrespective of their case and then join them together. Line 5: similar as before but only match if they are upper case.
In [1]: seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgcc
...: attatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCC
...: CCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCC
...: GCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCG
...: CTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
...:
...: NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
...:
...: nucleotides = ['G','A','C','T', 'U']
In [2]: # if lower case gatcu allowed
In [3]: [''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
Out[3]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
In [4]: # if lower case is not allowed
In [5]: [''.join(i for i in x if i in nucleotides) for x in NTs]
Out[5]:
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
'ACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAGTCAGCCCCGCAGGGG',
'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']