How to scrape hierarchical web data into tabular format using rvest?

September 11, 2024

I am generally familiar with rvest. I know the difference between html_elements() and html_element(). But I can’t get my head around this problem:

Suppose that we have data like the one that is on this webpage. The data is in a hierarchical format and each header has a different number of subheadings.

When I try to scrape, I get 177 headers. But, the subheadings are actually 270. I want to extract the data into a tidy format. But with different vector sizes, I can’t easily combine them into a tibble.

Here is my code with some comments about the results:

page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")

person_departments <- page %>% 
    html_elements(".item-list") %>% 
    html_element("h3") %>% 
    html_text2()
# The above code returns 

person_names <- page %>% 
  html_elements(".item-list li") %>% 
  html_element("h4") %>% 
  html_text2()
# This one returns 270 names (some departments have more than 1 admin)

# Using the above codes, I can't get a nice table with two columns, one for the name and one for the person's department.

>Solution :

The main trick is to get the children of the "item-list" elements. Then it’s a matter of processing into a table. Each department is an odd numbered list element, the persons are in the even numbered ones. Care must be taken with vacant department positions.

library(rvest)

link <- "https://postdocs.stanford.edu/about/department-postdoc-admins"
page <- read_html(link)

person_table <- page %>% 
  html_elements(".item-list") %>% 
  html_children() |>
  html_text2() |>
  strsplit("\nPhone:|\n|Email:") |>
  lapply(trimws)

i_odd <- c(TRUE, FALSE)
# get how many persons in each dept
r <- lengths(person_table[!i_odd]) %/% 3L
# if "Vacant", adjust to 1 person (named "Vacant", see below)
r[r == 0L] <- 1L
Department <- person_table[i_odd] |> unlist() |> rep(r)
Person <- lapply(person_table[!i_odd], \(x) {
  # these are the vacant persons
  if( length(x) == 1L) {
    matrix(c(x, "", ""), nrow = 1L)
  } else {
    matrix(x, ncol = 3L, byrow = TRUE)
  }
  }) |> do.call(rbind, args = _)

result <- cbind(Department, Person) |>
  as.data.frame() |>
  setNames(c("Department", "Name", "Phone", "Email"))

str(result)
#> 'data.frame':    270 obs. of  4 variables:
#>  $ Department: chr  "Advanced Residency Training at Stanford" "Aeronautics and Astronautics" "African & African-Amer Studies" "Anesthes, Periop & Pain Med" ...
#>  $ Name      : chr  "Sofia Gonzales" "Jenny Scholes" "Ashante Johnson" "Natalie Darling-Cabrera" ...
#>  $ Phone     : chr  "(650) 724-9139" "(510) 468-5967" "(650) 721-3969" "(650) 497-0648" ...
#>  $ Email     : chr  "sofias@stanford.edu" "jscholes@stanford.edu" "ashantej@stanford.edu" "ndarling@stanford.edu" ...
head(result)
#>                                Department                    Name
#> 1 Advanced Residency Training at Stanford          Sofia Gonzales
#> 2            Aeronautics and Astronautics           Jenny Scholes
#> 3          African & African-Amer Studies         Ashante Johnson
#> 4             Anesthes, Periop & Pain Med Natalie Darling-Cabrera
#> 5             Anesthes, Periop & Pain Med          Ashley Johnson
#> 6             Anesthes, Periop & Pain Med        Jessica Martinez
#>            Phone                 Email
#> 1 (650) 724-9139   sofias@stanford.edu
#> 2 (510) 468-5967 jscholes@stanford.edu
#> 3 (650) 721-3969 ashantej@stanford.edu
#> 4 (650) 497-0648 ndarling@stanford.edu
#> 5 (650) 721-7212 ashley85@stanford.edu
#> 6 (650) 497-8189 jimenez5@stanford.edu

^{Created on 2024-09-11 with reprex v2.1.0}