Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby: add suffix to elements which are identical across groups

I have a dataframe like this:

                                peakID cytoband      start        end    length  10.388_116  10.193_156  10.401_184  10.214_385
0  Amp_2q37.3_chr2:237990001-242193529   2q37.3  237990001  242193529   4203528           1           0           0           0
1  Del_2q37.3_chr2:226990001-242193529   2q37.3  226990001  242193529  15203528          -1           0           0           0

Notice how peakID is different, but cytoband is not. I need to unpivot this table (using a function from pyjanitor) without keeping peakID. Currently I do:

import pandas as pd
import pyjanitor
from natsort import natsort_keygen

table = (
        table
        .drop(columns="peakID")
       .pivot_longer(index=["cytoband", "start", "end", "length"],
                     names_to="sample", values_to="state")
            .sort_values(["cytoband", "sample"], key=natsort_keygen())
            .remove_columns(["length", "start", "end"])
            .set_index("cytoband")
        )

And the end result looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

table.loc["2q37.3", :]
Out[36]: 
              sample  state
cytoband                   
2q37.3    10.193_156      0
2q37.3    10.193_156      0
2q37.3    10.214_385      0
2q37.3    10.214_385      0
2q37.3    10.388_116      1
2q37.3    10.388_116     -1
2q37.3    10.401_184      0
2q37.3    10.401_184      0

The problem lies in the fact that if cytoband is duplicated in different peakIDs, the resulting table will have the two records (state) for each sample mixed up (as they don’t have the relevant unique ID anymore).

The idea would be to suffix the duplicate records across distinct peakIDs (e.g. "2q37.3_A", "2q37.3_B", but I’m not sure on how to do that with groupby or pandas in general as I need information from more than one group.

What’s the cleanest solution to do this? Existing solutions (or this one) don’t really fit.

>Solution :

Use a groupby.cumcount and a mapper:

from string import ascii_uppercase
letters = dict(enumerate(ascii_uppercase))

table['sample'] += '_' + table.groupby(['cytoband', 'sample']).cumcount().map(letters)

Output:

                sample  state
cytoband                     
2q37.3    10.193_156_A      0
2q37.3    10.193_156_B      0
2q37.3    10.214_385_A      0
2q37.3    10.214_385_B      0
2q37.3    10.388_116_A      1
2q37.3    10.388_116_B     -1
2q37.3    10.401_184_A      0
2q37.3    10.401_184_B      0

Or, keep peakID:

out = (
        table
        .pivot_longer(index=["peakID", "cytoband", "start", "end", "length"],
                     names_to="sample", values_to="state")
            .sort_values(["cytoband", "sample"], key=natsort_keygen())
            .remove_columns(["length", "start", "end"])
            .set_index(["cytoband", "peakID"])
        )

Output:

                                                  sample  state
cytoband peakID                                                
2q37.3   Amp_2q37.3_chr2:237990001-242193529  10.193_156      0
         Del_2q37.3_chr2:226990001-242193529  10.193_156      0
         Amp_2q37.3_chr2:237990001-242193529  10.214_385      0
         Del_2q37.3_chr2:226990001-242193529  10.214_385      0
         Amp_2q37.3_chr2:237990001-242193529  10.388_116      1
         Del_2q37.3_chr2:226990001-242193529  10.388_116     -1
         Amp_2q37.3_chr2:237990001-242193529  10.401_184      0
         Del_2q37.3_chr2:226990001-242193529  10.401_184      0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading