Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Capturing the last group: everything when the first character appears

I am trying to capture everything after and including the first non-digit character in the following text:

1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                  00814766
                                                            P O BOX 883                       FAX 909 386-1288
                                                            COLTON CA  92324

For example, I would want regex to capture groups in a way that it matches: 1, 1,486,399.87, 5, and ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324.

The code I have right now is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools

# text
t = "    1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                  00814766
                                                            P O BOX 883                       FAX 909 386-1288
                                                            COLTON CA  92324"

tt = re.search(r"(\d+)\s+(\$?[+-]?\d{1,3}(\,\d{3})*%?(\.\d+)?)\s+(\d+)\s+(\S*)", t)

ttgroup = len(tt.groups())

print(tt[ttgroup])

It returns only ORTIZ. I suppose we need to improve the (S*) grouping towards the end of the pattern. Is there a way we could capture the entire ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324 in the last group? Thank you so much!

>Solution :

I’d replace the last group, that is now (\S*), with (\S.*) since you want to capture the rest of the string. Also add the re.DOTALL flag since this is a multiline string:

tt = re.search(r"(\d+)\s+(\$?[+-]?\d{1,3}(\,\d{3})*%?(\.\d+)?)\s+(\d+)\s+(\S.*)", t, re.DOTALL)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading