re.search for first (of two) dates in filenames

March 29, 2023

Can someone please help me locate the first date using regex from file names formatted:

TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv

But I am getting the following error on the re.search line: >AttributeError: ‘NoneType’ object has no attribute ‘group’

Because I am not searching re.search correctly, can someone please correct my re.search based on the above filename? I want to pull the first date only from the file names.
I am new to regex, can anyone help? Thank you!

I tried the following and I am expected to extract the first date in each file name formatted as 2022-03-04

date = re.search(‘\b(\d{4}-\d{2}-\d{2}).’, filename)

>Solution :

There are a few problems with your regex.

First, the regex itself is incorrect:

\b       # Match a word boundary (non-word character followed by word character or vice versa)
(        # followed by a group which consists of
  \d{4}- # 4 digits and '-', then
  \d{2}- # 2 digits and '-', then
  \d{2}  # another 2 digits
)        # and eventually succeeded by
\.       # a dot

Since your filename (TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv) doesn’t have any such group, re.search() fails and returns None. Here is why:

2022-03-04 is not succeeded by a dot
\b does not match as both _ and 2 are considered word character.

That being said, the regex should be modified, like this:

(?<=_)   # Match something preceded by '_', which will not be included in our match,
\d{4}-   # 4 digits and '-', then
\d{2}-   # 2 digits and '-', then
\d{2}    # another 2 digits, then
\b       # a word boundary

Now, do you see those backslashes? Always remember that you need to escape them again in strings. This can be automated using raw strings:

r'(?<=_)\d{4}-\d{2}-\d{2}\b'

Try it:

filename = 'TEST_2022-03-04-05-30-20.csv_parsed.csv_encrypted.csv'
match = re.search(r'(?<=_)\d{4}-\d{2}-\d{2}\b', filename).group(0)

print(m) # '2022-03-04'