Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex to split on new lines with a pattern

I am trying to split a string into multiple strings (like observations).

For example, a sample text with 3 "bidder id" observations is:

       BID RANK       BID TOTAL   BIDDER ID         BIDDER INFORMATION  (NAME/ADDRESS/LOCATION)
       --------      -----------  ---------         -------------------------------------------------
           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489

The ultimate goal is to create a dataset that mimics this text document. The first step is to split this big string into multiple small strings. For example, the three small strings would look as follows:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Split string 1

           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

Split string 2

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

Split String 3

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489

I started using the split pattern as [\r\n]+\s+, but unfortunately, it splits by any new line and not just the new line with no other character/text in it.

Code:

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

txt = "                   1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                              00814766
                                                        P O BOX 883                       FAX 909 386-1288
                                                        COLTON CA  92324

               2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                              00688659
                                                        2230 LEMON AVENUE                 FAX 562 591-7485
                                                        LONG BEACH CA  90806

               3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                              00116307
                                                        P O BOX 1489                      FAX 818 767-3169
                                                        SUN VALLEY CA  91353-1489"

p = re.split("[\r\n]+",txt)

But it splits text by all the possible new lines. Is there a way to separate text by a newline with no other character in it? Thank you so much!!

P.S. if you think I’m doing something wildly wrong or if there’s a much simpler way to create a dataset – please let me know. Any help is appreciated. Thanks!!

>Solution :

You can try re.findall with pattern (regex101):

(?ms)^\s{,20}\d.*?(?=^\s{,20}\d|\Z)

import re

text = """\
       BID RANK       BID TOTAL   BIDDER ID         BIDDER INFORMATION  (NAME/ADDRESS/LOCATION)
       --------      -----------  ---------         -------------------------------------------------
           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489"""

groups = re.findall(r"(?ms)^\s{,20}\d.*?(?=^\s{,20}\d|\Z)", text)

for group in groups:
    print(group)
    print('-' * 80)

Prints:

           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

--------------------------------------------------------------------------------

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

--------------------------------------------------------------------------------

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489
--------------------------------------------------------------------------------
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading