Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas – Remove part of string in column that is already in another column

I have this dataframe :

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA

And I want to get this dataframe :

dfB = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu'],
            'C': ['def', 'jkl', 'pqr', 'vwx'],
         })
dfB

The column ‘C’ must contains the substrings of the column ‘B’ that is not in the strings in column ‘A’.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried to copy column ‘B’ into ‘C’ and then use df.replace() as shown below, but it doesn’t work :

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA.loc[:,'C'] = dfA['B']

dfA['C'].replace(dfA['B'], '', regex=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16112\1271772080.py in <cell line: 7>()
      5 dfA.loc[:,'C'] = dfA['B']
      6 
----> 7 dfA['C'].replace(dfA['B'], '', regex=True)

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4958         method: str | lib.NoDefault = lib.no_default,
   4959     ):
-> 4960         return super().replace(
   4961             to_replace=to_replace,
   4962             value=value,

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6677                     # Operate column-wise
   6678                     if self.ndim == 1:
-> 6679                         raise ValueError(
   6680                             "Series.replace cannot use dict-like to_replace "
   6681                             "and non-None value"

ValueError: Series.replace cannot use dict-like to_replace and non-None value

Also, the strings in ‘A’ must be the pre/sufix of the ‘B’ column, so the ‘C’ column will be the su/prefix of ‘B’ strings. So, 'B' = 'A'+'C' | 'C'+'A', I also tried to use - as a "decatenation" operator, but it doesn’t work.

Do you have any idea of how I should do this instead ?

>Solution :

You need to loop here.

You can use re.sub:

import re
dfA['C'] = [re.sub(a, '', b) for a,b in zip(dfA['A'], dfA['B'])]

or str.replace:

dfA['C'] = [b.replace(a, '') for a,b in zip(dfA['A'], dfA['B'])]

output:

     A       B    C
0  abc  abcdef  def
1  ghi  jklghi  jkl
2  mno  mnopqr  pqr
3  stu  vwxstu  vwx
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading