Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to scrape hierarchical web data into tabular format using rvest?

I am generally familiar with rvest. I know the difference between html_elements() and html_element(). But I can’t get my head around this problem:

Suppose that we have data like the one that is on this webpage. The data is in a hierarchical format and each header has a different number of subheadings.

When I try to scrape, I get 177 headers. But, the subheadings are actually 270. I want to extract the data into a tidy format. But with different vector sizes, I can’t easily combine them into a tibble.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is my code with some comments about the results:

page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")

person_departments <- page %>% 
    html_elements(".item-list") %>% 
    html_element("h3") %>% 
    html_text2()
# The above code returns 

person_names <- page %>% 
  html_elements(".item-list li") %>% 
  html_element("h4") %>% 
  html_text2()
# This one returns 270 names (some departments have more than 1 admin)

# Using the above codes, I can't get a nice table with two columns, one for the name and one for the person's department.

>Solution :

The main trick is to get the children of the "item-list" elements. Then it’s a matter of processing into a table. Each department is an odd numbered list element, the persons are in the even numbered ones. Care must be taken with vacant department positions.

library(rvest)

link <- "https://postdocs.stanford.edu/about/department-postdoc-admins"
page <- read_html(link)

person_table <- page %>% 
  html_elements(".item-list") %>% 
  html_children() |>
  html_text2() |>
  strsplit("\nPhone:|\n|Email:") |>
  lapply(trimws)

i_odd <- c(TRUE, FALSE)
# get how many persons in each dept
r <- lengths(person_table[!i_odd]) %/% 3L
# if "Vacant", adjust to 1 person (named "Vacant", see below)
r[r == 0L] <- 1L
Department <- person_table[i_odd] |> unlist() |> rep(r)
Person <- lapply(person_table[!i_odd], \(x) {
  # these are the vacant persons
  if( length(x) == 1L) {
    matrix(c(x, "", ""), nrow = 1L)
  } else {
    matrix(x, ncol = 3L, byrow = TRUE)
  }
  }) |> do.call(rbind, args = _)

result <- cbind(Department, Person) |>
  as.data.frame() |>
  setNames(c("Department", "Name", "Phone", "Email"))

str(result)
#> 'data.frame':    270 obs. of  4 variables:
#>  $ Department: chr  "Advanced Residency Training at Stanford" "Aeronautics and Astronautics" "African & African-Amer Studies" "Anesthes, Periop & Pain Med" ...
#>  $ Name      : chr  "Sofia Gonzales" "Jenny Scholes" "Ashante Johnson" "Natalie Darling-Cabrera" ...
#>  $ Phone     : chr  "(650) 724-9139" "(510) 468-5967" "(650) 721-3969" "(650) 497-0648" ...
#>  $ Email     : chr  "sofias@stanford.edu" "jscholes@stanford.edu" "ashantej@stanford.edu" "ndarling@stanford.edu" ...
head(result)
#>                                Department                    Name
#> 1 Advanced Residency Training at Stanford          Sofia Gonzales
#> 2            Aeronautics and Astronautics           Jenny Scholes
#> 3          African & African-Amer Studies         Ashante Johnson
#> 4             Anesthes, Periop & Pain Med Natalie Darling-Cabrera
#> 5             Anesthes, Periop & Pain Med          Ashley Johnson
#> 6             Anesthes, Periop & Pain Med        Jessica Martinez
#>            Phone                 Email
#> 1 (650) 724-9139   sofias@stanford.edu
#> 2 (510) 468-5967 jscholes@stanford.edu
#> 3 (650) 721-3969 ashantej@stanford.edu
#> 4 (650) 497-0648 ndarling@stanford.edu
#> 5 (650) 721-7212 ashley85@stanford.edu
#> 6 (650) 497-8189 jimenez5@stanford.edu

Created on 2024-09-11 with reprex v2.1.0

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading