Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex: Match all characters in between an underscore and a period

I have a set of file names in which I need to extract their dates. The file names look like:

['1 120836_1_20210101.csv',
 '1 120836_1_20210108.csv',
 '1 120836_20210101.csv',
 '1 120836_20210108.csv',
 '10 120836_1_20210312.csv',
 '10 120836_20210312.csv',
 '11 120836_1_20210319.csv',
 '11 120836_20210319.csv',
 '12 120836_1_20210326.csv',
 ...
]

As an example, I would need to extract 20210101 from the first item in the list above.

Here is my code but it is not working – I’m not totally familiar with regex.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import re
dates = []
for file in files:
    dates.extend(re.findall("(?<=_)\d{}(?=\d*\.)", file))

>Solution :

You weren’t that far off, but there were a few issues:

  • you extend dates by the result of the .findall, but you only expect to find one and are constructing all of dates, so that would be a lot simpler with a re.search in a list comprehension
  • your regex has a few unneeded complications (and some bugs)

This is what you were after:

import re

files = [
    '1 120836_1_20210101.csv',
    '1 120836_1_20210108.csv',
    '1 120836_20210101.csv',
    '1 120836_20210108.csv',
    '10 120836_1_20210312.csv',
    '10 120836_20210312.csv',
    '11 120836_1_20210319.csv',
    '11 120836_20210319.csv',
    '12 120836_1_20210326.csv'
]

dates = [re.search(r"(?<=_)\d+(?=\.)", fn).group(0) for fn in files]

print(dates)

Output:

['20210101', '20210108', '20210101', '20210108', '20210312', '20210312', '20210319', '20210319', '20210326']

It keeps the lookbehind for an underscore, and changes the lookahead to look for a period. It just matches all digits (at least one, with +) in between the two.

Note that the r in front of the string avoids having to double up the backslashes in the regex, the backslashes in \d and \. are still required to indicate a digit and a literal period.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading