So I’m scraping the link to all the posts on this subreddit (specifically the top posts for the last 24hrs.)
But when I run my program it sometimes outputs all the data, and other times outputs nothing. Same exact code. It works about 1/5 of the time.
# URL of subreddit
test = requests.get('https://www.reddit.com/r/TikTokCringe/top/')
# the html of the request
html = test.text
# making a soup of the html
soup = BeautifulSoup(html, 'html.parser')
# the find_all is finding the first 30 a elements that have a href that starts with '/r/TikTokCringe/comments'
for href in soup.find_all('a', {"href": re.compile('/r/TikTokCringe/comments/*')})[:30]:
# im looping through every element because I eventually want to get just the links
# for now im just trying to print every element
print(href)
>Solution :
You’re getting HTTP error 429 – Too many requests. Try to slow down or set User-Agent HTTP header:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}
# URL of subreddit
test = requests.get("https://reddit.com/r/TikTokCringe/top/", headers=headers)
...
Also: consider using their JSON format (add .json at the end of the URL):
data = requests.get(
"https://reddit.com/r/TikTokCringe/top/.json", headers=headers
).json()
print(data)