Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove parts of a string for multiple columns in a pandas dataframe

I have a dataframe that with columns of organisms that contain lists of gene names that are associated with a particular orthogroup. I also have a list of candidate genes that I would like to interrogate this dataframe with.

What I am trying to do is iterate through the dataframe and only keep the part of the string that matches any of the candidate genes in the gene list. If there are no matches for any of the strings in a row, I’d like to just drop that row entirely.

What I have tried:
I tried using the solution linked here, which works perfectly, but only for one column – I want to do this for all columns.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have also tried to use some combinations of np.where() without any yield.

Reproducible data:

Dataframe example:

{'Orthogroup': {0: 'OG0000000',
  1: 'OG0000001',
  2: 'OG0000002',
  3: 'OG0000003',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1, WP_113526628.1',
  1: 'WP_009560941.1, WP_009567366.1, WP_012536141.1, WP_012536296.1, WP_012536581.1, WP_012537464.1, WP_012537498.1, WP_041645259.1, WP_229129755.1',
  2: nan,
  3: 'WP_009562369.1',
  4: 'WP_009564938.1, WP_012535757.1, WP_012535999.1, WP_012536379.1, WP_012536404.1, WP_012536894.1, WP_012537015.1, WP_012537128.1, WP_232027572.1'},
 'acidithiobacillus_thiooxidans': {0: 'WP_010639275.1, WP_010639739.1, WP_010640967.1, WP_029316246.1, WP_157999017.1, WP_157999033.1, WP_157999054.1',
  1: 'WP_010637203.1, WP_010637257.1, WP_010639175.1, WP_010639301.1, WP_010639864.1, WP_010639966.1, WP_010639971.1, WP_010640895.1, WP_010641185.1, WP_050816103.1, WP_226825859.1, WP_226826127.1',
  2: 'WP_235179950.1',
  3: 'WP_010639365.1, WP_010642423.1',
  4: 'WP_010637024.1, WP_010637323.1, WP_010637554.1, WP_010637935.1, WP_010639011.1, WP_010639453.1, WP_010640567.1, WP_010642114.1, WP_024893638.1'},
 'burkholderia_pseudomallei': {0: 'WP_004185903.1, WP_004190283.1, WP_004528082.1, WP_004530921.1, WP_004534726.1, WP_011204972.1',
  1: 'WP_004187662.1, WP_004187694.1, WP_004188321.1, WP_004189039.1, WP_004192300.1, WP_004198029.1, WP_004526108.1, WP_004538115.1, WP_004551242.1, WP_004553906.1, WP_004554106.1, WP_009931394.1',
  2: 'WP_004187329.1, WP_004191606.1, WP_004195354.1, WP_004524029.1, WP_004524162.1, WP_004524990.1, WP_004525719.1, WP_004526156.1, WP_004529761.1, WP_004534748.1, WP_004552066.1, WP_004552880.1, WP_004553008.1, WP_004553113.1, WP_004554092.1, WP_014696855.1, WP_024431032.1, WP_024431563.1, WP_038742977.1, WP_038794545.1, WP_038803065.1',
  3: 'WP_004185837.1, WP_004187646.1, WP_004188318.1, WP_004189230.1, WP_004189405.1, WP_004192049.1, WP_004192063.1, WP_004194581.1, WP_004197652.1, WP_004198340.1, WP_004266649.1, WP_004521296.1, WP_004522591.1, WP_004522925.1, WP_004523005.1, WP_004524197.1, WP_004525408.1, WP_004525453.1, WP_004528137.1, WP_004529948.1, WP_004531197.1, WP_004536377.1, WP_004541514.1, WP_004543954.1, WP_004553290.1, WP_004553756.1, WP_004553993.1, WP_011204859.1',
  4: 'WP_004193481.1, WP_004195334.1, WP_004195802.1, WP_004196705.1, WP_004196791.1, WP_004200681.1, WP_004521379.1, WP_004524701.1, WP_004526731.1, WP_004529133.1, WP_004535483.1, WP_004537932.1, WP_004552076.1, WP_038729337.1'},
 'e_coli': {0: 'WP_000043761.1, WP_000859945.1, WP_000969032.1, WP_001266293.1, WP_001301264.1, WP_001310896.1, WP_001360132.1',
  1: 'WP_000125282.1, WP_000148503.1, WP_000417791.1, WP_000941041.1, WP_001188777.1, WP_001251544.1, WP_001295369.1, WP_001301108.1, WP_001301332.1',
  2: 'WP_000483239.1, WP_000919536.1, WP_001098559.1, WP_001297437.1',
  3: 'WP_000164036.1, WP_000357790.1, WP_000440317.1, WP_000648572.1, WP_000817708.1, WP_000935206.1, WP_001061575.1, WP_001242684.1',
  4: 'WP_000365791.1, WP_000423261.1, WP_000442949.1, WP_000548294.1, WP_000952503.1, WP_001011462.1, WP_001019525.1, WP_001109794.1, WP_001300638.1, WP_001350493.1'}}

Candidate genes example:

["WP_012536716.1","WP_004196791.1","WP_001297437.1"]

Ideal output:
Ideally, this should look like:

{'Orthogroup': {0: 'OG0000000',
  2: 'OG0000002',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1',
  2: nan,
  4: nan},
 'acidithiobacillus_thiooxidans': {0: nan,
  2: nan,
  4: nan},
 'burkholderia_pseudomallei': {0: nan,
  2: nan,
  4: 'WP_004196791.1'},
 'e_coli': {0: nan,
  2: 'WP_001297437.1',
  4: nan}}

Any help is greatly appreciated!

Thanks.

>Solution :

Use custom lambda function with DataFrame.applymap for filtering splitted values by candidates, if no match get empty strings, so last add DataFrame.replace for replace empty strings to missing values:

candidates = set(["WP_012536716.1","WP_004196791.1","WP_001297437.1"])


f = lambda x: ', '.join(y for y in x.split(', ') if y in candidates)
df = (df.set_index('Orthogroup')
        .fillna('')
        .applymap(f)
        .replace('', np.nan)
        .dropna(how='all')
        .reset_index())
print (df)
  Orthogroup acidithiobacillus_ferrooxidans  acidithiobacillus_thiooxidans  \
0  OG0000000                 WP_012536716.1                            NaN   
1  OG0000002                            NaN                            NaN   
2  OG0000004                            NaN                            NaN   

  burkholderia_pseudomallei          e_coli  
0                       NaN             NaN  
1                       NaN  WP_001297437.1  
2            WP_004196791.1             NaN  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading