How to not append array if the array contains any element that is a substring of intended appendment in Python

I am scraping a website for links with the following code to create an array:

from bs4 import BeautifulSoup
import requests

URL = "https://www.sportsbet.com.au/horse-racing/australia-nz/flemington/race-2-7681724"
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}

r = requests.get(url=URL, headers=headers)
datalinks = []

soup = BeautifulSoup(r.content, "html.parser")

for a in soup.find_all('a', href=True):

        if ("race" in a['href']):
                datalinks.append(a['href'])
                
print(datalinks)

And it returns the following array:
[‘/horse-racing/australia-nz/flemington/race-1-7681723’, ‘/horse-racing/australia-nz/flemington/race-2-7681724’, ‘/horse-racing/australia-nz/flemington/race-3-7681725’, ‘/horse-racing/australia-nz/flemington/race-4-7681726’, ‘/horse-racing/australia-nz/flemington/race-5-7681727’, ‘/horse-racing/australia-nz/flemington/race-6-7681729’, ‘/horse-racing/australia-nz/flemington/race-7-7681730’, ‘/horse-racing/australia-nz/flemington/race-8-7681731’, ‘/horse-racing/australia-nz/flemington/race-9-7681732’, ‘/horse-racing/australia-nz/flemington/race-10-7681733’, ‘/horse-racing/australia-nz/flemington/race-4-7681726’, ‘/horse-racing/australia-nz/flemington/race-4-7681726/same-race-multi’, ‘/horse-racing/australia-nz/flemington/race-4-7681726/exotics’]

datalinks[3] is: /horse-racing/australia-nz/flemington/race-4-7681726
so I would like to exclude appending the last 3 elements into the array because an element in the array already exists as a substring of those last 3 elements (datalinks[3]).

I can check if the intended element to append exists as a substring and not include it:

if ("race" in a['href'] and a['href'] not in datalinks):

but don’t know how to exclude if any existing element exists as a substring of the intended appendment.

If anyone could help that would be greatly appreciated thanks.

>Solution :

You can modify your code to check if the current element’s href contains any existing elements in datalinks. You can achieve this by using a loop to iterate over the existing elements and checking if each one is a substring of the current element’s href. Here’s how you can do it:

from bs4 import BeautifulSoup
import requests

URL = "https://www.sportsbet.com.au/horse-racing/australia-nz/flemington/race-2-7681724"
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}

r = requests.get(url=URL, headers=headers)
datalinks = []

soup = BeautifulSoup(r.content, "html.parser")

for a in soup.find_all('a', href=True):
    if ("race" in a['href']):
        append_link = True
        for existing_link in datalinks:
            if existing_link in a['href']:
                append_link = False
                break
        if append_link:
            datalinks.append(a['href'])

print(datalinks)

Leave a Reply