Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Matching .srt file subtitle line and timestamps with regex

As the title states, I want to match the timestamp and text lines of a .srt file subtitles.

some of these files are not formatted properly, so I need something to work with almost all of them.

the correct formatting of a file is like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

1
00:00:02,160 --> 00:00:04,994
You really don't remember
what happened last year?

2
00:00:06,440 --> 00:00:07,920
- School. Now.
- I dropped out.

3
00:00:08,120 --> 00:00:10,510
- Get your diploma, I'll get mine.
- What you doing?

4
00:00:10,680 --> 00:00:13,514
- Studying.
- You taking your GED? All right, Fi.

and the regex pattern that I came up with is working very well for this kind of files.

as I said, some of the files are not formatted properly, some of them don’t have the line number, some of them don’t have a new line after each subtitle line and the regex that I came up with does not work properly for those.

There are other questions like this that have already been answered, but I want to match each timestamp and text line in a separate matching-group. so my groups for the first line of the mentioned example would be something like this:

group 1: 00:00:02,160

group 2: 00:00:04,994

group 3: You really don't remember\nwhat happened last year?

this is what I’ve got so far:

LINE_RE = (
    # group 1:
    r"^\s*(\d+:\d+:\d+,\d+)"  # line starts with any number of whitespace
    # and followed by a time format like 00:00:00,000
    r"(?:\s*-{2,3}>\s*)"  # non-matching group for ' --> '
    # matches one or more of - follwed by a >
    # group 2:
    r"(\d+:\d+:\d+,\d+)\s*\n"  # time format again,
    # ended with any number of whitespace and a \n
    # group 3:
    r"([\s\S]*?(?:^\s*$|\d+:\d+:\d+,\d+|^\s*\d+\s*\n))"
    # matches any character, until it hits an empty line, a line with only a number in it or a timestamp

)

I think my exact problem is in the last non-matching group, it does not work properly when the next line is not an empty line.

this is an example file, I did some mangling in the file so I could show the problem better.

>Solution :

In that case, you can match the lines that start with a timestamp like pattern, and capture all lines that do not start with either a newline and a single digit or another timestamp like pattern.

^\s*(\d+:\d+:\d+,\d+)[^\S\n]+-->[^\S\n]+(\d+:\d+:\d+,\d+)((?:\n(?!\d+:\d+:\d+,\d+\b|\n+\d+$).*)*)

The pattern in parts matches:

  • ^ Start of string
  • \s* Match optional whitspace chars
  • (\d+:\d+:\d+,\d+) Capture group 1, match a timestamp like pattern
  • [^\S\n]+-->[^\S\n]+ Match --> between 1 or more spaces
  • (\d+:\d+:\d+,\d+) Capture group 2, same pattern as for group 1
  • ( Capture group 3
    • (?: Non capture group \n Match a newline
      • (?! Negative lookahead, assert what is to the right is not
        • \d+:\d+:\d+,\d+\b|\n+\d+$ Match either a timestamp or 1+ newlines followed by only digits
      • ) Close lookahead
      • .* Match the whole line
    • )* Close the non capture group and optionally repeat it
  • ) Close group 3

See a regex demo.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading