Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python parse inconsistent free form street names using regex

For data in the following structure I want to obtain the parsed street name details:

# streetname 1() refers to house number 1 with an empty () additional qualifier 

keyword_token: street name 4()
keyword_token: street-name 14()

keyword_token: streetname 123()keyword_token: streetname 123()
# why is it logged one message per line, but we get the address logged twice - sometimes??

keyword_token: streetname 9(7)keyword_token: streetname 9(7)
keyword_token: streetname 27()\r\n a lot more text and log messages in the free form text log - one messageper line  \n
    
keyword_token: street-name 1-23(BLOCK D HAUS 6)keyword_token: street-name 1-23(BLOCK H HAUS 2)keyword_token: street-name 1-23(BLOCK G HAUS 3)',
        
        

The ideall expected result is:
3 fields for each record:

  • street name
  • house number
  • additional qualifier (empty/NaN) if it is empty/missing

So far I experimented with the regex of: keyword_token(.*), but this is giving the whole line after the keyword token.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Complications:

  • I am only interested in the first match (not many) i..e only the first occurence of keyword_token:
  • the street name itself can be quite inconsistent (spaces, -) it will start after the keyword_token: and go until the (

edit: an example regex101 is found here https://regex101.com/r/ueEfNU/1

edit 2: also not numeric house numbers need to be supported.

keyword_token: street_name 32a()

>Solution :

You can use

keyword_token:\s*(.*?)\s*(\d+[a-zA-Z]*)\(([^()]*)\)
keyword_token:\s*(.*?)\s*(\d[a-zA-Z\d]*)\(([^()]*)\)

See the regex demo. Details:

  • keyword_token: – a fixed string
  • \s* – zero or more whitespaces
  • (.*?) – Group 1: any zero or more chars other than line break chars, as few as possible (due to *? lazy quantifier)
  • \s* – zero or more whitespaces
  • (\d+[a-zA-Z]*) – Group 2: one or more digits and then zero or more letters
  • \( – a ( char
  • ([^()]*) – Group 3: one or more chars other than ( and )
  • \) – a ) char.
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading