Is there a better way to cleanup text files with Python, using regex?

June 14, 2024

I’m trying to create a script to match regex patterns in a series of text files, and the remove those matches from the file. Right now, I have the following, which works for my purposes, but I don’t think this is an effective way to do it:

import os
import re

os.chdir("/home/user1/test_files")

patterns = ['(bannana)',
            '(peaches)',
            '(apples)'
           ]

subst = ""
cwd = os.getcwd()
for filename in os.listdir(cwd):
    with open(filename, 'r', encoding="utf8") as f:
        file = f.read()
    result = re.sub('|'.join(patterns), subst, file, re.MULTILINE)
    with open("/home/user1/output_files/" + "output_" + str(filename), 'w', encoding="utf-8") as newfile:
        newfile.write(result)

    for pattern in patterns:
        with open('/home/user1/output_files/output_'+str(filename), 'r', encoding="utf8") as f:
            file = f.read()
        result = re.sub(pattern, subst, file, re.MULTILINE)
        with open('/home/user1/output_files/output_'+str(filename), 'w', encoding="utf-8") as newfile:
            newfile.write(result)

So, lets say I have a file, grocery.txt, and I want to remove the words apples, peaches, and bannana. The above script will first run through and create an output file, output_grocery.txt. It will then iterate through the patterns list, removing the pattern from output_grocery.txt and rewriting it after each pass.

The way I’m doing this right now is not scalable. I’ll eventually need to run this on hundreds of files, each one being rewritten again and again depending on how many regex patterns I have. I originally tried doing this in one go, using:

result = re.sub('|'.join(patterns), subst, file, re.MULTILINE)

thinking that would remove all the patterns in one go from the file. However, this only removes the first pattern, in this case bannana.

Is there a better, more scalable way to do this?

>Solution :

Expanding on my comment. While there are likely improvements to be made to this, you should at the very least only open the input once, clean up the data, and then write the data out once. None of the open/write/open/write/open/write/open/write stuff

import os
import re

os.chdir("/home/user1/test_files")

patterns = ['(bannana)',
            '(peaches)',
            '(apples)'
           ]

subst = ""
cwd = os.getcwd()
for filename in os.listdir(cwd):

    #open the file and place contents in `file` variable
    with open(filename, 'r', encoding="utf8") as f:
        file = f.read()

    #iterate over your patterns, replacing the pattern with 
    #nothing and updating the `file` variable for the next 
    #pattern iteration
    for pattern in patterns:
        file = re.sub(pattern, subst, file, re.MULTILINE)

    #write the `file` variable back out
    with open("/home/user1/output_files/" + "output_" + str(filename), 'w', encoding="utf-8") as newfile:
        newfile.write(file)

Now that the logic is isolated, you can work on improving that that for pattern in patterns to not be a loop to gain extra efficiency.