Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove certain words from URL

I scraped tweet statuses, from which I’m removing certain words; however, it doesn’t work effectively as it only removes the first string in "stopwords".

Code:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
        link_list.append(link)

Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

https://twitter.com/CultOfCurtis/status/1492292326051483648
https://twitter.com/ZBumblenuts/status/1492292306149560321
https://twitter.com/AndreWillemse4/status/1492292279129804806
https://twitter.com/JaimeeJakobczak/status/1492292268354584578
https://twitter.com/consequence/status/1492245783084773383/photo/1
https://twitter.com/consequence/status/1492245783084773383
https://twitter.com/EVStyle2/status/1492292266169298944
https://twitter.com/SammyMorgan/status/1492292246766436355
https://twitter.com/gayesian/status/1492292246456184841
https://twitter.com/khendriix_/status/1492292245734707202
https://twitter.com/Mauro_Sosa_S/status/1492292242320539650

I tried different codes after researching, but to no avail. :/

>Solution :

You just need to de-indent the last line there:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
    link_list.append(link) 

In its original position, it would append the link with /people removed, and then append the link again with /photo/1 removed – so any /photo/1 links would still get included.

You could alternatively apply this suggestion here and use a compiled regular expression:

import re

stopwords = ['/people', '/photo/1']
pattern = re.compile('|'.join(map(re.escape, stopwords)))
link_list = [pattern.sub('', link) for link in links]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading