Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regular Expression to Pull Information from String in Python

What I am trying to do is take my current string and remove all data from it that doesn’t contain the actual software version. Here is the string I am currently working with:

print (CurrentVersion)

Delivers the output:

2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here,  2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory,  2021, \\\\here\\is\\another\\path_2021,   2020, http://some.will/even/look/like/this,   2022r2,   2023

When what I really want is this for an output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

2018, 2019, 2020, 2021, 2022r2, 2023

What I have tried was to come up with a regular expression to remove the excess data. It looks like ‘[0-9, ]’ will pull out the numbers and commas getting me closer to my goal. So I came up with this code:

RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())

But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can’t seem to get that far.

So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?

>Solution :

You might use a capture group:

(?:^|,\s*)(\d{4}\w*)(?=,|$)

The pattern matches:

  • (?:^|,\s*) Match either the start of the string, or match a comma followed by optional whitespace chars
  • (\d{4}\w*) Capture at least 4 digits followed by optional word characters
  • (?=,|$) Assert either a comma or the end of the string to the right

See a regex demo

Example

import re
 
pattern = r"(?:^|,\s*)(\d{4}\w*)(?=,|$)"
 
s = ("2018, \\\\\\\\some\\\\directory\\\\is\\\\here, \\\\\\\\some\\\\directory\\\\is\\\\here,  2019, \\\\\\\\here\\\\is\\\\another\\\\directory, \\\\\\\\here\\\\is\\\\another\\\\directory,  2021, \\\\\\\\here\\\\is\\\\another\\\\path_2021,   2020, http://s...content-available-to-author-only...e.will/even/look/like/this,   2022r2,   2023\n")
 
print(re.findall(pattern, s))

Output

['2018', '2019', '2021', '2020', '2022r2', '2023']

Other options could be finding all the years that start with 20 and then optionally match r followed by 1 of more digits:

(?:^|,\s*)(20\d\d(?:r\d+)?)(?=,|$)

Regex demo

Or matching 4 digits followed by all except a comma:

(?:^|,\s*)(\d{4}[^,]*)

Regex demo

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading