I have a text file containing The Tragedie of Macbeth. I want to clean it and the first step is to remove everything upto the line The Tragedie of Macbeth and store the remaining part in removed_intro_file.
I tried:
import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
removed_intro = file.read()
with open('removed_intro_file', 'w') as output:
removed = re.sub(title, '', removed_intro)
print(removed)
output.write(removed)
The print statement doesn’t print anything so it doesn’t match anything. How can I use regex over several lines? Should one instead use pointers that point to the start and end of the lines to removed? I’d also be glad to know if there is a nicer way to solve this maybe not using regex.
>Solution :
your regex only replaces title with ''; you want to remove the title and all text before it, so search for all characters (including newlines) from the beginning of the string to the title included; this should work (I only tested it on a sample file I wrote):
removed = re.sub(r'(?s)^.*'+re.escape(title), '', removed_intro)