Remove certain words from URL

February 12, 2022

I scraped tweet statuses, from which I’m removing certain words; however, it doesn’t work effectively as it only removes the first string in "stopwords".

Code:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
        link_list.append(link)

Output:

https://twitter.com/CultOfCurtis/status/1492292326051483648
https://twitter.com/ZBumblenuts/status/1492292306149560321
https://twitter.com/AndreWillemse4/status/1492292279129804806
https://twitter.com/JaimeeJakobczak/status/1492292268354584578
https://twitter.com/consequence/status/1492245783084773383/photo/1
https://twitter.com/consequence/status/1492245783084773383
https://twitter.com/EVStyle2/status/1492292266169298944
https://twitter.com/SammyMorgan/status/1492292246766436355
https://twitter.com/gayesian/status/1492292246456184841
https://twitter.com/khendriix_/status/1492292245734707202
https://twitter.com/Mauro_Sosa_S/status/1492292242320539650

I tried different codes after researching, but to no avail. :/

>Solution :

You just need to de-indent the last line there:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
    link_list.append(link)

In its original position, it would append the link with /people removed, and then append the link again with /photo/1 removed – so any /photo/1 links would still get included.

You could alternatively apply this suggestion here and use a compiled regular expression:

import re

stopwords = ['/people', '/photo/1']
pattern = re.compile('|'.join(map(re.escape, stopwords)))
link_list = [pattern.sub('', link) for link in links]