I am working on a python app that will help me get reviews for a particular restaurant.
I am using Selenium 4.1 web scraper with python.
After I set up Selenium driver in my project folder I put this code together based on the Selenium documentation:
#YELP REVIEW SCRAPER #
#Importing Dependencies
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
# Setting up driver options
options = webdriver.ChromeOptions()
# Setting up Path to chromedriver executable file
CHROMEDRIVER_PATH ='../Selenium/chromedriver.exe'
# Adding options
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
# Setting up chrome service
service = ChromeService(executable_path=CHROMEDRIVER_PATH)
# Establishing Chrom web driver using set services and options
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
This successfully opens up the Yelp page of the restaurant I want to get reviews for, but when i tried to scrape the reviews using:
driver.find_element(By.CLASS_NAME, ' raw__09f24__T4Ezm')
where: ‘ raw__09f24__T4Ezm’ is the name of the span class of the first review, i get the error:
InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified
(Session info: chrome=96.0.4664.45)
Stacktrace:
Backtrace:
Ordinal0 [0x00BD6903+2517251]
Ordinal0 [0x00B6F8E1+2095329]
Ordinal0 [0x00A72848+1058888]
Ordinal0 [0x00A74F44+1068868]
Ordinal0 [0x00A74E0E+1068558]
Ordinal0 [0x00A75070+1069168]
Ordinal0 [0x00A9D1C2+1233346]
Ordinal0 [0x00A9D63B+1234491]
Ordinal0 [0x00AC7812+1406994]
Ordinal0 [0x00AB650A+1336586]
Ordinal0 [0x00AC5BBF+1399743]
Ordinal0 [0x00AB639B+1336219]
Ordinal0 [0x00A927A7+1189799]
Ordinal0 [0x00A93609+1193481]
GetHandleVerifier [0x00D65904+1577972]
GetHandleVerifier [0x00E10B97+2279047]
GetHandleVerifier [0x00C66D09+534521]
GetHandleVerifier [0x00C65DB9+530601]
Ordinal0 [0x00B74FF9+2117625]
Ordinal0 [0x00B798A8+2136232]
Ordinal0 [0x00B799E2+2136546]
Ordinal0 [0x00B83541+2176321]
BaseThreadInitThunk [0x757C6739+25]
RtlGetFullPathName_UEx [0x773B8AFF+1215]
RtlGetFullPathName_UEx [0x773B8ACD+1165]
I tried researching this error but had no luck.
Any idea how to modify my code so I can get all available reviews for this particular restaurant so I can get the date of review, person, score, and the text of the review?
>Solution :
I don’t personally know how to parse data with selenium as I use Beautifulsoup, here is a example with Beautifulsoup:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
#driver.get('https://www.nicehash.com/profitability-calculator/nvidia-rtx-3060-ti-lhr')
driver.get('https://www.yelp.com/biz/taste-of-texas-houston')
content = driver.page_source
soup = BeautifulSoup(content, features="lxml")
a = soup.findAll("li", attrs={'class':'margin-b5__09f24__pTvws border-color--default__09f24__NPAKY'})
for i in a:
print(i.text)
From there you can parse it again looking for the data you need.