Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Webscraping Pizza Shops – "read_html" not working?

I am working with the R programming language.

I trying to scrape the name and address of the pizza stores on this website https://www.yellowpages.ca/search/si/2/pizza/Canada (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada, https://www.yellowpages.ca/search/si/3/pizza/Canada, https://www.yellowpages.ca/search/si/4/pizza/Canada, etc.)

I am trying to follow the answer provided here: Scraping Yellowpages in R

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(rvest)
library(stringr)

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"


library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

But this code is taking a very long time to run. I tried to investigate by running individual parts of the function, and I think I found the problem: The "read_html" statement itself is not working. I tried to replace this with another statement:

 library(httr)
 webpage <- GET(url)

This works, but now the format is not the same.

Can someone please show me how to do this?

In the end, I would like the output to look something like this:

  id                 name                                       address
1  1   OJ's Steak & Pizza 9906B Franklin Ave, Fort McMurray, AB T9H 2K5
2  2    MJs Pizza & Grill 10012 Franklin Ave, Fort McMurray, AB T9H 2K6
3  3 Hu's Pizza & Donairs 10020 Franklin Ave, Fort McMurray, AB T9H 2K6

# sample results

sample_results = structure(list(id = c(1, 2, 3), name = c("OJ's Steak & Pizza", 
"MJs Pizza & Grill", "Hu's Pizza & Donairs"), address = c("9906B Franklin Ave, Fort McMurray, AB T9H 2K5", 
"10012 Franklin Ave, Fort McMurray, AB T9H 2K6", "10020 Franklin Ave, Fort McMurray, AB T9H 2K6"
)), class = "data.frame", row.names = c(NA, -3L))

Thanks!

>Solution :

Fast, but not robust. (If there are missing either name or address, the code will break, I think.)

library(tidyverse)
library(rvest)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

scraper("https://www.yellowpages.ca/search/si/2/pizza/Canada")

# A tibble: 35 x 2
   name                                  address                                 
   <chr>                                 <chr>                                   
 1 OJ's Steak & Pizza                    9906B Franklin Ave, Fort McMurray, AB T~
 2 MJs Pizza & Grill                     10012 Franklin Ave, Fort McMurray, AB T~
 3 Hu's Pizza & Donairs                  10020 Franklin Ave, Fort McMurray, AB T~
 4 Eagle Ridge Convenience Store & Pizza 117-375 Loutit Rd, Fort McMurray, AB T9~
 5 Cosmos Pizza                          9713 Hardin St, Fort McMurray, AB T9H 1~
 6 Boston Pizza                          10202 MacDonald Ave, Fort McMurray, AB ~
 7 Jomaa's Pizza & Chicken               Beacon Hill Shpg Plaza, Fort McMurray, ~
 8 Abasand PK's Pizza                    101-307 Athabasca Ave, Fort McMurray, A~
 9 Pizza 73                              1-289 Powder Dr, Ft McMurray, AB T9K 0M5
10 Boston Pizza                          110 Millennium Dr, Fort McMurray, AB T9~
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading