Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

merging data frames by conditions and saving common values belonging to the same column efficiently

I have two data frames that I want to merge by the following condition

  1. A common value between one column between the 2 data frame
    .query("CHROM == Chromosome_hg38")
  2. Region of one value between the regions of the other value
    .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')

These are my data frames:

data = [['chr1', 1,10,'Gene1','ID1'], ['chr2',15,20,'Gene2','ID2'],['chr2',21,40,'Gene3','ID3']]
LOUEF = pd.DataFrame(data, columns = ['Chromosome_hg38', 'Gene_start_hg38', 'Gene_stop_hg38','Gene_name','geneID'])
LOUEF

Chromosome_hg38 Gene_start_hg38 Gene_stop_hg38  Gene_name   geneID
0   chr1        1               10              Gene1        ID1
1   chr2        15              20              Gene2        ID2
2   chr2        21              40              Gene3        ID3

data2 = [['chr1', 3,11,'location1'], ['chr1',11,17,'location2'],['chr2',20,30,'location3'],['chr2',15,17,'location3'],['chr5',1,19,'location4']]
cnvs = pd.DataFrame(data2, columns = ['CHROM', 'START', 'END','locations'])


CHROM   START   END locations
0   chr1    3   11  location1
1   chr1    11  17  location2
2   chr2    20  30  location3
3   chr2    15  17  location3
4   chr5    1   19  location4

The code I am using

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

cnvs['Gene_name'] = (
    cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]   
       .reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
       .query("CHROM == Chromosome_hg38")
       .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
       .groupby('index')['Gene_name'].apply(list).str.join(';')
)

With this code I got this

CHROM   START   END locations   Gene_name
0   chr1    3   11  location1   Gene1
1   chr1    11  17  location2   NaN
2   chr2    20  30  location3   Gene2;Gene2
3   chr2    15  17  location3   Gene2
4   chr5    1   19  location4   NaN

As I have done with gene_name, how could I do the same with geneID without the need of repeat the same code again?

CHROM   START   END locations   Gene_name    geneID
0   chr1    3   11  location1   Gene1        ID1
1   chr1    11  17  location2   NaN          NaN
2   chr2    20  30  location3   Gene2;Gene2  ID2;ID2
3   chr2    15  17  location3   Gene2        ID2
4   chr5    1   19  location4   NaN          NaN

>Solution :

This way?

for key in ("Gene_name", "geneID"):
    cnvs[key] = (
        cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]   
           .reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
           .query("CHROM == Chromosome_hg38")
           .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
           .groupby('index')[key].apply(list).str.join(';')
    )
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading