I have a huge dataframe to work with. I want to output exons only to an output file.
Not just all exons , only exons of mRNA inside a gene block.
I need to write a script in python.
INPUT:
NC_048323.1 Gnomon gene 25044 78977 . + . ID=gene -LOC117420859;Dbxref=GeneID:117420859;Name=LOC117420859;gbkey=Gene;gene=LOC1174 20859;gene_biotype=protein_coding
NC_048323.1 Gnomon mRNA 25044 78977 . + . ID=rna- XM_034926345.1;Parent=gene-LOC117420859;Dbxref=GeneID:117420859,Genbank:XM_0349 26345.1;Name=XM_034926345.1;gbkey=mRNA;gene=LOC117420859;model_evidence=Support ing evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the an notated genomic feature by RNAseq alignments;product=coiled-coil domain-contain ing protein 171-like;transcript_id=XM_034926345.1
NC_048323.1 Gnomon exon 25044 25136 . + . ID=exon -XM_034926345.1-1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_ 034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing protein 171-like;transcript_id=XM_034926345.1
NC_048323.1 Gnomon exon 25929 26031 . + . ID=exon -XM_034926345.1-2;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_ 034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing protein 171-like;transcript_id=XM_034926345.1
....
NC_048323.1 Gnomon CDS 76336 76521 . + 0 ID=cds-XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil domain-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1 Gnomon CDS 78960 78977 . + 0 ID=cds-XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil domain-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1 Gnomon gene 111664 172479 . - . ID=gene-LOC117421266;Dbxref=GeneID:117421266;Name=LOC117421266;gbkey=Gene;gene=LOC117421266;gene_biotype=protein_coding
NC_048323.1 Gnomon mRNA 111664 172479 . - . ID=rna-XM_034035429.2;Parent=gene-LOC117421266;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;Name=XM_034035429.2;gbkey=mRNA;gene=LOC117421266;model_evidence=Supporting evidence includes similarity to: 13 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 7 samples with support for all annotated introns;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 172022 172479 . - . ID=exon-XM_034035429.2-1;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 157760 157889 . - . ID=exon-XM_034035429.2-2;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 131303 131497 . - . ID=exon-XM_034035429.2-3;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 125107 125237 . - . ID=exon-XM_034035429.2-4;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 124379 124607 . - . ID=exon-XM_034035429.2-5;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 123710 123872 . - . ID=exon-XM_034035429.2-6;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon CDS 114179 114352 . - 0 ID=cds-XP_033891320.1;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XP_033891320.1;Name=XP_033891320.1;gbkey=CDS;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like isoform X2;protein_id=XP_033891320.1
NC_048323.1 Gnomon mRNA 111664 172479 . - . ID=rna-XM_034035428.2;Parent=gene-LOC117421266;Dbxref=GeneID:117421266,Genbank:XM_034035428.2;Name=XM_034035428.2;gbkey=mRNA;gene=LOC117421266;model_evidence=Supporting evidence includes similarity to: 13 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 10 samples with support for all annotated introns;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_034035428.2
NC_048323.1 Gnomon exon 172022 172479 . - . ID=exon-XM_034035428.2-1;Parent=rna-XM_034035428.2;Dbxref=GeneID:117421266,Genbank:XM_034035428.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_034035428.2
NC_048323.1 Gnomon exon 157760 157889 . - . ID=exon-XM_034035428.2-2;Parent=rna-XM_034035428.2;Dbxref=GeneID:117421266,Genbank:XM_034035428.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_034035428.2
NC_048323.1 Gnomon gene 111664 172479 . - . ID=gene-LOC117421266;Dbxref=GeneID:117421266;Name=LOC117421266;gbkey=Gene;gene=LOC117421266;gene_biotype=protein_coding
OUTPUT:output rows
How can I do that?
df.loc[df[‘column_name’] == ‘exon’] – This is not good for me. There are rows in my data frame that are like this:
NC_048323.1 Gnomon gene 111664 172479 . - . ID=gene-LOC117421266;Dbxref=GeneID:117421266;Name=LOC117421266;gbkey=Gene;gene=LOC117421266;gene_biotype=protein_coding
NC_048323.1 Gnomon tRNA 111664 172479 . - . ID=rna-XM_034035429.2;Parent=gene-LOC117421266;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;Name=XM_034035429.2;gbkey=mRNA;gene=LOC117421266;model_evidence=Supporting evidence includes similarity to: 13 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 7 samples with support for all annotated introns;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 172022 172479 . - . ID=exon-XM_034035429.2-1;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
NC_048323.1 Gnomon exon 157760 157889 . - . ID=exon-XM_034035429.2-2;Parent=rna-XM_034035429.2;Dbxref=GeneID:117421266,Genbank:XM_034035429.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiation factor 2-alpha kinase 3-like%2C transcript variant X2;transcript_id=XM_034035429.2
I need exons that are after mRNA only
>Solution :
IIUC, you can use
out = df[(df['column_name'] == 'exon') & (df['column_name'] == 'mRNA').shift()]