how to write this romove_stopwords faster python?

December 31, 2021

i have a function remove_stopwords like this, how to make it run faster?

temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

time for solve a text in my data is 14s and if i have some trick like this time for will decrease to 3s:


temp.reverse()

def drop_stopwords(text):
    
    for x in temp:
        if len(x.split()) >2:
            if x in text:
                text = text.replace(x,'')

        elif len(x.split()) > 1:
            text_list = text.split()  
            for y in range(len(text_list)-len(x.split())):
                if " ".join(text_list[y:y+len(x.split())]) == x:
                    del text_list[y:y+len(x.split())]
                    text = " ".join(text_list)
        
        else:
            text = " ".join(text for text in text.split() if text not in vietnamese)

    return text

but i think it may get wrong some where in my language. How can i rewrite this function in python to make it faster ( in C and C++ i can solve it ez with func above :(( )

>Solution :

Your function does a lot of the same thing over and over, particularly repeated split and join of the same text. Doing a single split, operating on the list, and then doing a single join at the end might be faster, and would definitely lead to simpler code. Unfortunately I don’t have any of your sample data to test the performance with, but hopefully this gives you something to experiment with:

temp = ["foo", "baz ola"]


def drop_stopwords(text):
    text_list = text.split()
    text_len = len(text_list)
    for word in temp:
        word_list = word.split()
        word_len = len(word_list)
        for i in range(text_len + 1 - word_len):
            if text_list[i:i+word_len] == word_list:
                text_list[i:i+word_len] = [None] * word_len
    return ' '.join(t for t in text_list if t)


print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog

You could also just try iteratively doing text.replace in all cases and seeing how that performs compared to your more complex split-based solution:

temp = ["foo", "baz ola"]


def drop_stopwords(text):
    for word in temp:
        text = text.replace(word, '')
    return ' '.join(text.split())


print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog