Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Efficient way to transform a dictionary into a dataframe in pandas

I have a dictionary such as :

  mydict=  {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}

I wondered if someone knew an efficient way to process this dictionary and create a dataframe from it by adding three columns:

  • Scaffolds column which is the keys of the dictionary
  • The Seq_length which is the length of the Seq string
  • The GC% which is the number of G and C letters within Seq divided by the Seq_length (for example len(Seq) of scaffold1 is 42, and there are 18 G and C letters (so GC% = 18/42)

I should then get :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Scaffolds Seq_length GC%
scaffold1 42         0.428 
scaffold2 53         0.453  

I’m looking for an efficient way to do this task as my real dict is really huge (1,046,544 keys)

Thanks a lot for your help

>Solution :

You can rework the dictionary:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

mydict = {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}

from Bio.SeqUtils import GC

df = pd.DataFrame([{'Scaffolds': k,
                    'Seq_length': len(s.seq),
                    'GC%': GC(s.seq)}
                   for k, s in mydict.items()])

output:

   Scaffolds  Seq_length        GC%
0  scaffold1          42  42.857143
1  scaffold2          53  45.283019
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading