Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to extract substring between parentheses that start with a digit, and has multiple sets of parentheses

My goal is to extract the substring between a set of parentheses, but only if it starts with a digit. Several of the strings will have multiple sets of parentheses but only one will contain a string that starts with a digit.

Currently, it is extracting everything between the first parenth and the last one, rather than it seeing 2 seprate sets of them.

As far as only using the parentheses with a substring that starts with a digit, I am lost as to how to even approach this.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any help is appreciated.

import pandas as pd

cols = ['a', 'b']
data = [
    ['xyz - (4 inch), (four inch)', 'abc'],
    ['def', 'ghi'],
    ['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((.*)\)") 

Desired output:

                                a    b       c
0     xyz - (4 inch), (four inch)  abc  4 inch
1                             def  ghi     NaN
2  xyz - ( 5.5 inch), (five inch)  abc     NaN

current output:

                                a    b                       c
0     xyz - (4 inch), (four inch)  abc     4 inch), (four inch
1                             def  ghi                     NaN
2  xyz - ( 5.5 inch), (five inch)  abc   5.5 inch), (five inch

>Solution :

The following pattern should do the job: \((\d[^.)]+)\)

What it does is

  • Matches the character ‘(‘
  • Start capturing numbers and everything that doesn’t contain ‘)’ or ‘.’.
  • End capturing.
  • Matches the character ‘)’

You can see a detailed explanation on regex101

Final code:

import pandas as pd

cols = ['a', 'b']
data = [
    ['xyz - (4 inch), (four inch)', 'abc'],
    ['def', 'ghi'],
    ['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((\d[^.)]+)\)") 

print(df)

Output generated:

a    b       c
0     xyz - (4 inch), (four inch)  abc  4 inch
1                             def  ghi     NaN
2  xyz - ( 5.5 inch), (five inch)  abc     NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading