Efficiency in extracting data from webscraping in R

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let’s take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):

univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
                   function(url_base){
                     url_base %>% read_html() %>% 
                       html_nodes('.salary') %>% 
                       html_text()
                   })

Let’s say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.

I tried to put it in the following structure, but the code sent me to the debugger

 ....     function(url_base){
                                  url_base %>% read_html() %>% 
                                    Salary <- univsalaries %>% 
                                    html_nodes('.salary') %>% html_text()
                                    Company <- univsalaries %>% 
                                      html_nodes('.company') %>% html_text()
                                    dt<-tibble(Salary,Company) 
                                })

>Solution :

Read the webpage once and then you can extract multiple values from the same page.

library(purrr)
library(rvest)

univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
                       function(url_base){
                         webpage <- url_base %>% read_html() 
                         data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(), 
                                    Company = webpage %>% html_nodes('.company') %>% html_text())
                       })

This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

Leave a Reply