I’d like to match all patterns between <PDF> and </PDF> inside a string:
import re
lines = """
hello
<PDF>
bla1
</PDF>
test
<PDF>
bla2
</PDF>
"""
matches = re.findall(r"<PDF>.*</PDF>", lines, re.DOTALL)
print(matches)
Output:
['<PDF>\nbla1\n</PDF>\ntest\n<PDF>\nbla2\n</PDF>']
Expected Output:
['<PDF>\nbla1\n</PDF>', '<PDF>\nbla2\n</PDF>']
What’s going wrong here? How can I ensure that no text between </PDF> and <PDF> gets matched?
>Solution :
* is greedy, so it tries to match as much as possible.
Use *? in this case. See Python’s documentation of module re:
Adding
?after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
matches = re.findall(r"<PDF>.*?</PDF>", lines, re.DOTALL)