I have the following string, which I extract from a pdf:
This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence
the text includes figs and tables, most of which have a caption on top and a source on bottom, but some don’t. Fundamentally, the text i want to be left with should be:
This is
a beautiful
and fully
complete
sentence
I have tried the following:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring,flags=re.MULTILINE)
but unfortunately it returns:
This is
a beautiful
sentence
with my limited knowledge of regex i cannot figure out how to put such a condition:
it should stop at the first \n after Source, only if there is no new fig|table in between, in which case it should have stopped at the first \n from start
any idea?
thank you
>Solution :
What you need to match is a Fig
or Table
followed by either
- Characters up to and including a line starting with
Source
, with noFig
orTable
in between the original one andSource
; or - Characters up to the end of line
You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source
is found does not precede Fig
or Table
. This regex will do what you want:
(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)
This matches:
(?:Fig|Table)
: a wordFig
orTable
; and then either(?:(?!Fig|Table)[\s\S])+?
: a minimal number of characters, none of which precede either of the wordsFig
orTable
Source[^\n]*\n
: The wordSource
followed by some number of characters until newline; or[^\n]*\n
some number of characters until newline
Regex demo on regex101
In python:
s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)
Output:
This is
a beautiful
and fully
complete
sentence
Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip
.