Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Removing characters from a list of strings if they don't follow a list

I have a python code and I’m working with a list of sequences

seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'

NTs = [seq0,seq1,seq2,seq3,seq4,seq5]

nucleotides = ['G','A','C','T', 'U']

if any(x not in nucleotides for x in NTs):
    print("ERROR: non-nucleotide characters present")

so this works so far and it does tell me if there are non-nucleotide characters present, but I also need the code to remove those non-nucleotide characters so I can further process the sequences. I was wondering how I would do this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Line 3 and 5 has two different way of doing it. Line 3: gather all the characters that are also present in nucleotides irrespective of their case and then join them together. Line 5: similar as before but only match if they are upper case.

In [1]: seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgcc
   ...: attatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCC
   ...: CCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCC
   ...: GCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCG
   ...: CTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
   ...: 
   ...: NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
   ...: 
   ...: nucleotides = ['G','A','C','T', 'U']

In [2]: # if lower case gatcu allowed

In [3]: [''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
Out[3]: 
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
 'attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
 'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
 'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG',
 'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
 'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']

In [4]: # if lower case is not allowed

In [5]: [''.join(i for i in x if i in nucleotides) for x in NTs]
Out[5]: 
['CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC',
 'ACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG',
 'TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC',
 'CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAGTCAGCCCCGCAGGGG',
 'ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA',
 'AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading