Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Returning empty string for missing capture group Python regex

I’m working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:

ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'

ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'

What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:

# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex1)

# Output:
['Graduated', 'Graduated']

which is what I want, two entries for two Bachelor’s degrees. For ex2, however, one of the Bachelor’s degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex2)

# Output:
['Graduated']

# Desired Output:
['', 'Graduated']

I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!

>Solution :

You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.

re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)

Output:

>>> re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading