Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract multiple values from a string via regex

I have a number of strings from a 3rd party data source that are in various lengths that contain both underscore and spaces. Each portion of the string is important and I am trying to break it apart into various fields via python. The string does not have special characters (\n, \t, etc.) – should just be spaces, underscores, and parentheses are used to break the data parts.

String Year State ID Sub ID Extra1 Extra2
2022_UT_T1000_100 (Classification1 Classification2) 2022 UT T1000 100 Classification1 Classification2
2021_TX_V999_005 (Classification1) 2021 TX V999 005 Classification1
1999_GA_123456_7890 1999 GA 123456 7890

I could split the string by the underscore, then split the last field by a space but that seems error-prone. Regex is likely the best approach.

I can match the year using this: ^[1-9]\d{3,}$. However, when trying to add an OR operator, it will only find the underscore.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a way to extract this data when I know a pattern exists?

>Solution :

You could try using str.extract with the regex pattern:

^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$

Note that this pattern assumes that there would only be three variants in the string column, namely no extras, one extra, or at most two extras. For arbitrary number of words in parentheses, we would need a different approach.

Python script:

df[["Year", "State", "ID", "Sub ID", "Extra1", "Extra2"]] = df["String"].str.extract(r'^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$')

Here is a regex demo showing that the pattern is working for all variants of your string column.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading