Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

efficient way of removing leading and trailing whitepaces around hashes in text strings and then splitting the text into multiple columns of pandas

Suppose the pandas dataframe contains the following:

 import pandas as pd
 df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})

All I want to do is to remove leading and training white spaces around ‘only’ the dashes and then split the data into new (multiple) columns of a new dataframe. So that the new dataframe should look like this:

 Col1    Col2    Col3           Col4                Col5    Col6 ....
 ABC     XYZ     Some Text     none                 none    none
 DEF     XYZ     sometext     none                  none    none
 GHI     XYZ     sometext     none                  none    none
 JKL     XYZ     sometext     none                  none    none
 .
 .
 MNO6    XYZ     sometext       someother text      none    none

Basically depending upon the highest number of dashes, there will be the columns in the new dataframe (e.g. if it is 6 dashes then there will be 6 columns) and where ever there are no values for a column after split, there will be none values.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Now, what I am trying to do something like this:

df1 = df['text'].str.split(' - ', n=2, expand=True)
df1.columns = ['Col_1_{}'.format(x+1) for x in df1.columns]

and then 

df2 = df1['Col_1_1'].str.split('- ', n=1, expand=True)
df2.columns = ['Col_1_1_{}'.format(x+1) for x in df2.columns]

and so on so that later I can merge all these columns and do renaming of these.

But this seems not to be efficient, sorry as I am not a pro python :'(

Is there an efficient way of achieving the result the way I want? any suggestions would be appreciated.

>Solution :

You can use

df['text'].str.split(r'\s*[—–-]\s*', expand=True)

The \s*[—–-]\s* matches any hyphen, en- or em-dash and any zero or more whitespace chars on its left and right.

The expand=True will make it output a dataframe.

See the Pandas test:

import pandas as pd
df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})
df1 = df['text'].str.split(r'\s*[—–-]\s*', expand=True)
df1.columns = ['Col{}'.format(x+1) for x in df1.columns]

>>> df1
   Col1 Col2       Col3            Col4
0   ABC  XYZ  Some Text            None
1   DEF  XYZ   sometext            None
2   GHI  XYZ   sometext            None
3   JKL  XYZ   sometext            None
4  MNO1  XYZ  some text            None
5  MNO2  XYZ  some text            None
6  MNO3  XYZ  some text            None
7  MNO4  XYZ  some text            None
8  MNO5  XYZ   sometext  someother text
9  MNO6  XYZ   sometext  someother text
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading