How to remove parts of a string for multiple columns in a pandas dataframe

April 21, 2022

I have a dataframe that with columns of organisms that contain lists of gene names that are associated with a particular orthogroup. I also have a list of candidate genes that I would like to interrogate this dataframe with.

What I am trying to do is iterate through the dataframe and only keep the part of the string that matches any of the candidate genes in the gene list. If there are no matches for any of the strings in a row, I’d like to just drop that row entirely.

What I have tried:
I tried using the solution linked here, which works perfectly, but only for one column – I want to do this for all columns.

I have also tried to use some combinations of np.where() without any yield.

Reproducible data:

Dataframe example:

{'Orthogroup': {0: 'OG0000000',
  1: 'OG0000001',
  2: 'OG0000002',
  3: 'OG0000003',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1, WP_113526628.1',
  1: 'WP_009560941.1, WP_009567366.1, WP_012536141.1, WP_012536296.1, WP_012536581.1, WP_012537464.1, WP_012537498.1, WP_041645259.1, WP_229129755.1',
  2: nan,
  3: 'WP_009562369.1',
  4: 'WP_009564938.1, WP_012535757.1, WP_012535999.1, WP_012536379.1, WP_012536404.1, WP_012536894.1, WP_012537015.1, WP_012537128.1, WP_232027572.1'},
 'acidithiobacillus_thiooxidans': {0: 'WP_010639275.1, WP_010639739.1, WP_010640967.1, WP_029316246.1, WP_157999017.1, WP_157999033.1, WP_157999054.1',
  1: 'WP_010637203.1, WP_010637257.1, WP_010639175.1, WP_010639301.1, WP_010639864.1, WP_010639966.1, WP_010639971.1, WP_010640895.1, WP_010641185.1, WP_050816103.1, WP_226825859.1, WP_226826127.1',
  2: 'WP_235179950.1',
  3: 'WP_010639365.1, WP_010642423.1',
  4: 'WP_010637024.1, WP_010637323.1, WP_010637554.1, WP_010637935.1, WP_010639011.1, WP_010639453.1, WP_010640567.1, WP_010642114.1, WP_024893638.1'},
 'burkholderia_pseudomallei': {0: 'WP_004185903.1, WP_004190283.1, WP_004528082.1, WP_004530921.1, WP_004534726.1, WP_011204972.1',
  1: 'WP_004187662.1, WP_004187694.1, WP_004188321.1, WP_004189039.1, WP_004192300.1, WP_004198029.1, WP_004526108.1, WP_004538115.1, WP_004551242.1, WP_004553906.1, WP_004554106.1, WP_009931394.1',
  2: 'WP_004187329.1, WP_004191606.1, WP_004195354.1, WP_004524029.1, WP_004524162.1, WP_004524990.1, WP_004525719.1, WP_004526156.1, WP_004529761.1, WP_004534748.1, WP_004552066.1, WP_004552880.1, WP_004553008.1, WP_004553113.1, WP_004554092.1, WP_014696855.1, WP_024431032.1, WP_024431563.1, WP_038742977.1, WP_038794545.1, WP_038803065.1',
  3: 'WP_004185837.1, WP_004187646.1, WP_004188318.1, WP_004189230.1, WP_004189405.1, WP_004192049.1, WP_004192063.1, WP_004194581.1, WP_004197652.1, WP_004198340.1, WP_004266649.1, WP_004521296.1, WP_004522591.1, WP_004522925.1, WP_004523005.1, WP_004524197.1, WP_004525408.1, WP_004525453.1, WP_004528137.1, WP_004529948.1, WP_004531197.1, WP_004536377.1, WP_004541514.1, WP_004543954.1, WP_004553290.1, WP_004553756.1, WP_004553993.1, WP_011204859.1',
  4: 'WP_004193481.1, WP_004195334.1, WP_004195802.1, WP_004196705.1, WP_004196791.1, WP_004200681.1, WP_004521379.1, WP_004524701.1, WP_004526731.1, WP_004529133.1, WP_004535483.1, WP_004537932.1, WP_004552076.1, WP_038729337.1'},
 'e_coli': {0: 'WP_000043761.1, WP_000859945.1, WP_000969032.1, WP_001266293.1, WP_001301264.1, WP_001310896.1, WP_001360132.1',
  1: 'WP_000125282.1, WP_000148503.1, WP_000417791.1, WP_000941041.1, WP_001188777.1, WP_001251544.1, WP_001295369.1, WP_001301108.1, WP_001301332.1',
  2: 'WP_000483239.1, WP_000919536.1, WP_001098559.1, WP_001297437.1',
  3: 'WP_000164036.1, WP_000357790.1, WP_000440317.1, WP_000648572.1, WP_000817708.1, WP_000935206.1, WP_001061575.1, WP_001242684.1',
  4: 'WP_000365791.1, WP_000423261.1, WP_000442949.1, WP_000548294.1, WP_000952503.1, WP_001011462.1, WP_001019525.1, WP_001109794.1, WP_001300638.1, WP_001350493.1'}}

Candidate genes example:

["WP_012536716.1","WP_004196791.1","WP_001297437.1"]

Ideal output:
Ideally, this should look like:

{'Orthogroup': {0: 'OG0000000',
  2: 'OG0000002',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1',
  2: nan,
  4: nan},
 'acidithiobacillus_thiooxidans': {0: nan,
  2: nan,
  4: nan},
 'burkholderia_pseudomallei': {0: nan,
  2: nan,
  4: 'WP_004196791.1'},
 'e_coli': {0: nan,
  2: 'WP_001297437.1',
  4: nan}}

Any help is greatly appreciated!

Thanks.

>Solution :

Use custom lambda function with DataFrame.applymap for filtering splitted values by candidates, if no match get empty strings, so last add DataFrame.replace for replace empty strings to missing values:

candidates = set(["WP_012536716.1","WP_004196791.1","WP_001297437.1"])


f = lambda x: ', '.join(y for y in x.split(', ') if y in candidates)
df = (df.set_index('Orthogroup')
        .fillna('')
        .applymap(f)
        .replace('', np.nan)
        .dropna(how='all')
        .reset_index())
print (df)
  Orthogroup acidithiobacillus_ferrooxidans  acidithiobacillus_thiooxidans  \
0  OG0000000                 WP_012536716.1                            NaN   
1  OG0000002                            NaN                            NaN   
2  OG0000004                            NaN                            NaN   

  burkholderia_pseudomallei          e_coli  
0                       NaN             NaN  
1                       NaN  WP_001297437.1  
2            WP_004196791.1             NaN