Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merge consecutive rows in pandas and leave some rows untouched

I have tried looking at other merge rows in pandas solutions here and here and especially the solution here.

I want to combine the individual sentences scrapped from bullet points into one paragraph between the empty blank rows. BUT keep the blank rows as they are. I want to keep the first sentence’s paragraph id as the new id. (Paragraph ids are not necessarily continuous as there was some pre-cleaning done.)

df = pd.DataFrame(data = [[1, "A Heading"],
                          [2, "I need to be with above."],
                          [3, ""],
                          [8, "I stand alone."],
                          [9, ""]],columns=['para_id','text'])

df   
# The data
#    para_id                      text
# 0        1                 A Heading
# 1        2  I need to be with above.
# 2        3                          
# 3        8            I stand alone.
# 4        9                         

My required output is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#    para_id                                    text
# 0        1  A Heading. I need to be with above
# 1        3  
# 2        8  I stand alone.
# 3        9   

With the help of the answers I am close and just need a little more guidance.

Attempted solution

df['t'] =  df['text'].str.len().values
s = df['t'].eq(0).cumsum()
out = df.groupby(s).agg({'para_id': 'first',
                         'text': lambda x: '. '.join(x),
                         't': 'last'})
out.drop('t', inplace=True, axis=1)
   
out
# Incorrect output
#    para_id                                 text
# t                                              
# 0        1  A Heading. I need to be with above.
# 1        3                     . I stand alone.
# 2        9

I almost have it working but my blank lines are getting glued to some text. And so I am missing my first blank row.

  1. Please can someone help me formulate s better to get the desired output.

  2. I also need the joining . to only occur if there is not a fullstop at the end of last sentence. (This is not critical. I guess I could first search for missing fullstops at end of non-empty text sentences, and then join sentences but I was wondering if there was a mutate if kind of structure in pandas.)

>Solution :

You’re almost there, just groupby on both the non-zero lengths and cumsum:

s = df['text'].eq('')

(df.groupby([s.cumsum(),s], sort=False)
   .agg({'para_id':'first', 'text': '. '.join})
   .reset_index(drop=True)
)

Output:

   para_id                                 text
0        1  A Heading. I need to be with above.
1        3                                     
2        8                       I stand alone.
3        9                                     
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading