Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to web scrape table element using rvest?

I am looking to scrape data from this carrier link, I am using the rvest package in R and ive scraped some of the top information in the webpage by using this code below:

library(rvest)

url <- "https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true"
page <- read_html(url)

# Extract the table on the page
table <- page %>% html_nodes("table") %>% .[[2]] %>% html_table()

# Print the table
View(table)

Which yields this information:
pic1

However, I am looking to retrieve the information from the Tracing Information table in a tabular format instead:
pic2

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Here’s a mundate method:

library(rvest)
sess <- session("https://www.aaacooper.com/pwb/Transit/ProTrackResults.aspx?ProNum=241939875&AllAccounts=true")
html_table(sess)[[9]]
# # A tibble: 10 × 3
#    Date       Time  Description                                               
#    <chr>      <chr> <chr>                                                     
#  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
#  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
#  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
#  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
#  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
#  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
#  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
#  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
#  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            

The use of [[9]] was based on looking at all tables returned by html_table(), there’s nothing guaranteeing that number will persist.

A better method of finding a table is by looking for specific attributes/headers/names/ids, best found using the SelectorGadget.

A slightly more detailed look at the URL page reveals that the parent node of that table has class="tracingInformation", indicating we can do this:

html_element(sess, ".tracingInformation") %>%
  html_children() %>%
  html_table()
# [[1]]
# # A tibble: 10 × 3
#    Date       Time  Description                                               
#    <chr>      <chr> <chr>                                                     
#  1 2022-06-24 13:02 Delivered To Consignee In BRADENTON, FL                   
#  2 2022-06-24 04:22 Shipment arrived at destination Service Center   TAMPA, FL
#  3 2022-06-24 03:02 Shipment departed ORLANDO Service Center                  
#  4 2022-06-23 06:34 Shipment arrived at ORLANDO Service Center                
#  5 2022-06-22 22:54 Shipment departed DOTHAN Service Center                   
#  6 2022-06-21 22:52 Shipment arrived at DOTHAN Service Center                 
#  7 2022-06-21 10:36 Shipment departed HOUSTON Service Center                  
#  8 2022-06-21 03:15 Shipment arrived at HOUSTON Service Center                
#  9 2022-06-20 19:59 Shipment departed WESLACO Service Center                  
# 10 2022-06-20 12:21 Shipment Picked Up From Shipper In WESLACO, TX            

The walkthrough on how I found that. I’m using Firefox, I’m confident other browser have the same or very similar keys/tabs/names.

  1. Open that url in a browser.
  2. Once loaded, hit F12 (or whatever key enters the browser’s dev console).
  3. Select "Pick an element" and select a cell in the table you want. (In FF, this is a small button to the left of "Inspector".)
  4. Find the first reference to <table> above the cell. If this doesn’t have an unambiguous id= or class= (as in this example, I thought id="AAACooperMasterPage_bodyContent_grdViewTraceInfo" was a bit obscure/automated), go up a little higher until you find a clear id= or class=. In this case, I found that the table we want is encased in another table with class="tracingInformation".
  5. Use that in html_element(..).
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading