Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge multiple columns of a dataframe using regex?

I have a df which as following

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

Here what I want to do is combine those columns and we have two rules:

  1. If a column removes _C{0~9} or _C{0~9}{0~9} or _C{0~9}{0~9}{0~9} is equal to another column, these two columns can be combined.

Let’s take number_C1_E1 number_C2_E2 number_C3_E1 as an example, here number_C1_E1 and number_C3_E1 can be combined because they are both number_E1 after removing _C{0~9}.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  1. The two combined columns should get rid of the None values.

The desired result is

  number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0              1          apple            tom           None           None           None
1              2         banana          jerry           None           None           None
2              3      blueberry        anthony              3      blueberry        anthony
3              4     strawberry          terry           None           None           None
4              5     watermelon           paul           None           None           None
5              6          peach         edward           None           None           None
6              7         orange         reggie           None           None           None
7              8          lemon       nicholas           None           None           None

Anyone has a good solution?

>Solution :

Use the same as your previous question, but also compute a renamer for your columns:

group = df.columns.str.replace(r'_C\d+', '', regex=True)

names = df.columns.to_series().groupby(group).first()

out = (df.groupby(group, axis=1, sort=False).first()
         .rename(columns=names)
       )

Output:

  number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0            1        apple          tom         None         None         None
1            2       banana        jerry         None         None         None
2            3    blueberry      anthony            3    blueberry      anthony
3            4   strawberry        terry         None         None         None
4            5   watermelon         paul         None         None         None
5            6        peach       edward         None         None         None
6            7       orange       reggie         None         None         None
7            8        lemon     nicholas         None         None         None
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading