regex to remove captions with condition not to overlap second match

I have the following string, which I extract from a pdf:

This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence

the text includes figs and tables, most of which have a caption on top and a source on bottom, but some don’t. Fundamentally, the text i want to be left with should be:

This is
a beautiful
and fully
complete
sentence

I have tried the following:

s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring,flags=re.MULTILINE)

but unfortunately it returns:

This is
a beautiful
sentence

with my limited knowledge of regex i cannot figure out how to put such a condition:
it should stop at the first \n after Source, only if there is no new fig|table in between, in which case it should have stopped at the first \n from start

any idea?

thank you

>Solution :

What you need to match is a Fig or Table followed by either

  1. Characters up to and including a line starting with Source, with no Fig or Table in between the original one and Source; or
  2. Characters up to the end of line

You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source is found does not precede Fig or Table. This regex will do what you want:

(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)

This matches:

  • (?:Fig|Table) : a word Fig or Table; and then either
  • (?:(?!Fig|Table)[\s\S])+? : a minimal number of characters, none of which precede either of the words Fig or Table
  • Source[^\n]*\n : The word Source followed by some number of characters until newline; or
  • [^\n]*\n some number of characters until newline

Regex demo on regex101

In python:

s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)

Output:

This is
a beautiful
and fully
complete
sentence

Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip.

Leave a Reply