I am scraping a website for links with the following code to create an array:
from bs4 import BeautifulSoup
import requests
URL = "https://www.sportsbet.com.au/horse-racing/australia-nz/flemington/race-2-7681724"
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
r = requests.get(url=URL, headers=headers)
datalinks = []
soup = BeautifulSoup(r.content, "html.parser")
for a in soup.find_all('a', href=True):
if ("race" in a['href']):
datalinks.append(a['href'])
print(datalinks)
And it returns the following array:
[‘/horse-racing/australia-nz/flemington/race-1-7681723’, ‘/horse-racing/australia-nz/flemington/race-2-7681724’, ‘/horse-racing/australia-nz/flemington/race-3-7681725’, ‘/horse-racing/australia-nz/flemington/race-4-7681726’, ‘/horse-racing/australia-nz/flemington/race-5-7681727’, ‘/horse-racing/australia-nz/flemington/race-6-7681729’, ‘/horse-racing/australia-nz/flemington/race-7-7681730’, ‘/horse-racing/australia-nz/flemington/race-8-7681731’, ‘/horse-racing/australia-nz/flemington/race-9-7681732’, ‘/horse-racing/australia-nz/flemington/race-10-7681733’, ‘/horse-racing/australia-nz/flemington/race-4-7681726’, ‘/horse-racing/australia-nz/flemington/race-4-7681726/same-race-multi’, ‘/horse-racing/australia-nz/flemington/race-4-7681726/exotics’]
datalinks[3] is: /horse-racing/australia-nz/flemington/race-4-7681726
so I would like to exclude appending the last 3 elements into the array because an element in the array already exists as a substring of those last 3 elements (datalinks[3]).
I can check if the intended element to append exists as a substring and not include it:
if ("race" in a['href'] and a['href'] not in datalinks):
but don’t know how to exclude if any existing element exists as a substring of the intended appendment.
If anyone could help that would be greatly appreciated thanks.
>Solution :
You can modify your code to check if the current element’s href
contains any existing elements in datalinks
. You can achieve this by using a loop to iterate over the existing elements and checking if each one is a substring of the current element’s href
. Here’s how you can do it:
from bs4 import BeautifulSoup
import requests
URL = "https://www.sportsbet.com.au/horse-racing/australia-nz/flemington/race-2-7681724"
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
r = requests.get(url=URL, headers=headers)
datalinks = []
soup = BeautifulSoup(r.content, "html.parser")
for a in soup.find_all('a', href=True):
if ("race" in a['href']):
append_link = True
for existing_link in datalinks:
if existing_link in a['href']:
append_link = False
break
if append_link:
datalinks.append(a['href'])
print(datalinks)