Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Stuck in loop webscraping with selenium

I’m trying to scrape leboncoin using python and selenium.

I just got started when I noticed they use DataDome for bot detection, so I have to pass a captcha, but before trying to automate any of that (this question is not related to that) I just solved the Captcha by hand on the chromium browser that selenium opens, and It didn’t work, whenever I solve it it just goes back to the captcha, I can’t access the site, it’s stuck in a loop.

Here’s my code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--log-level=3")
driver = webdriver.Chrome(executable_path='chromedriver', options=options)

url = "https://www.leboncoin.fr/voitures/2182521551.htm"
driver.get("https://www.leboncoin.fr")
driver.get(url)

time.sleep(100)

>Solution :

Your code is fine.

The problem is that these kind of firewalls are mostly well protected against automated browsers such as Playwright, Selenium, etc. (In the end, this is what they should do, prevent bots from accessing the site)

You could either tweak your Selenium browsers configuration in such a way that it mimics an actual chrome configuration and tricks DataDome into thinking you’re a real user.

Also, you could look at what the payload being sent to the firewall ( in that case to ~/datadome.js ) consists off and try to replicate them. ( by trying to reverse engineer the JavaScript which constructs and sends the payload. )

Keep in mind that they can also create a fingerprint of you by looking at other things like your TLS configuration ( e.g. ciphersuites ) or simply your IP address. Generally if a company uses such a firewall, it means they do not want you to scrape their site, so avoid to do it if that is the case.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading