What I am trying to do is take my current string and remove all data from it that doesn’t contain the actual software version. Here is the string I am currently working with:
print (CurrentVersion)
Delivers the output:
2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here, 2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory, 2021, \\\\here\\is\\another\\path_2021, 2020, http://some.will/even/look/like/this, 2022r2, 2023
When what I really want is this for an output:
2018, 2019, 2020, 2021, 2022r2, 2023
What I have tried was to come up with a regular expression to remove the excess data. It looks like ‘[0-9, ]’ will pull out the numbers and commas getting me closer to my goal. So I came up with this code:
RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())
But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can’t seem to get that far.
So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?
>Solution :
You might use a capture group:
(?:^|,\s*)(\d{4}\w*)(?=,|$)
The pattern matches:
(?:^|,\s*)Match either the start of the string, or match a comma followed by optional whitespace chars(\d{4}\w*)Capture at least 4 digits followed by optional word characters(?=,|$)Assert either a comma or the end of the string to the right
See a regex demo
Example
import re
pattern = r"(?:^|,\s*)(\d{4}\w*)(?=,|$)"
s = ("2018, \\\\\\\\some\\\\directory\\\\is\\\\here, \\\\\\\\some\\\\directory\\\\is\\\\here, 2019, \\\\\\\\here\\\\is\\\\another\\\\directory, \\\\\\\\here\\\\is\\\\another\\\\directory, 2021, \\\\\\\\here\\\\is\\\\another\\\\path_2021, 2020, http://s...content-available-to-author-only...e.will/even/look/like/this, 2022r2, 2023\n")
print(re.findall(pattern, s))
Output
['2018', '2019', '2021', '2020', '2022r2', '2023']
Other options could be finding all the years that start with 20 and then optionally match r followed by 1 of more digits:
(?:^|,\s*)(20\d\d(?:r\d+)?)(?=,|$)
Or matching 4 digits followed by all except a comma:
(?:^|,\s*)(\d{4}[^,]*)