Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

regex: cleaning text: remove everything upto a certain line

I have a text file containing The Tragedie of Macbeth. I want to clean it and the first step is to remove everything upto the line The Tragedie of Macbeth and store the remaining part in removed_intro_file.

I tried:

import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
    removed_intro = file.read()
    with open('removed_intro_file', 'w') as output:
        removed = re.sub(title, '', removed_intro)
        print(removed)
        output.write(removed)

The print statement doesn’t print anything so it doesn’t match anything. How can I use regex over several lines? Should one instead use pointers that point to the start and end of the lines to removed? I’d also be glad to know if there is a nicer way to solve this maybe not using regex.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

your regex only replaces title with ''; you want to remove the title and all text before it, so search for all characters (including newlines) from the beginning of the string to the title included; this should work (I only tested it on a sample file I wrote):

removed = re.sub(r'(?s)^.*'+re.escape(title), '', removed_intro)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading