I am generally familiar with rvest. I know the difference between html_elements() and html_element(). But I can’t get my head around this problem:
Suppose that we have data like the one that is on this webpage. The data is in a hierarchical format and each header has a different number of subheadings.
When I try to scrape, I get 177 headers. But, the subheadings are actually 270. I want to extract the data into a tidy format. But with different vector sizes, I can’t easily combine them into a tibble.
Here is my code with some comments about the results:
page <- read_html("https://postdocs.stanford.edu/about/department-postdoc-admins")
person_departments <- page %>%
html_elements(".item-list") %>%
html_element("h3") %>%
html_text2()
# The above code returns
person_names <- page %>%
html_elements(".item-list li") %>%
html_element("h4") %>%
html_text2()
# This one returns 270 names (some departments have more than 1 admin)
# Using the above codes, I can't get a nice table with two columns, one for the name and one for the person's department.
>Solution :
The main trick is to get the children of the "item-list" elements. Then it’s a matter of processing into a table. Each department is an odd numbered list element, the persons are in the even numbered ones. Care must be taken with vacant department positions.
library(rvest)
link <- "https://postdocs.stanford.edu/about/department-postdoc-admins"
page <- read_html(link)
person_table <- page %>%
html_elements(".item-list") %>%
html_children() |>
html_text2() |>
strsplit("\nPhone:|\n|Email:") |>
lapply(trimws)
i_odd <- c(TRUE, FALSE)
# get how many persons in each dept
r <- lengths(person_table[!i_odd]) %/% 3L
# if "Vacant", adjust to 1 person (named "Vacant", see below)
r[r == 0L] <- 1L
Department <- person_table[i_odd] |> unlist() |> rep(r)
Person <- lapply(person_table[!i_odd], \(x) {
# these are the vacant persons
if( length(x) == 1L) {
matrix(c(x, "", ""), nrow = 1L)
} else {
matrix(x, ncol = 3L, byrow = TRUE)
}
}) |> do.call(rbind, args = _)
result <- cbind(Department, Person) |>
as.data.frame() |>
setNames(c("Department", "Name", "Phone", "Email"))
str(result)
#> 'data.frame': 270 obs. of 4 variables:
#> $ Department: chr "Advanced Residency Training at Stanford" "Aeronautics and Astronautics" "African & African-Amer Studies" "Anesthes, Periop & Pain Med" ...
#> $ Name : chr "Sofia Gonzales" "Jenny Scholes" "Ashante Johnson" "Natalie Darling-Cabrera" ...
#> $ Phone : chr "(650) 724-9139" "(510) 468-5967" "(650) 721-3969" "(650) 497-0648" ...
#> $ Email : chr "sofias@stanford.edu" "jscholes@stanford.edu" "ashantej@stanford.edu" "ndarling@stanford.edu" ...
head(result)
#> Department Name
#> 1 Advanced Residency Training at Stanford Sofia Gonzales
#> 2 Aeronautics and Astronautics Jenny Scholes
#> 3 African & African-Amer Studies Ashante Johnson
#> 4 Anesthes, Periop & Pain Med Natalie Darling-Cabrera
#> 5 Anesthes, Periop & Pain Med Ashley Johnson
#> 6 Anesthes, Periop & Pain Med Jessica Martinez
#> Phone Email
#> 1 (650) 724-9139 sofias@stanford.edu
#> 2 (510) 468-5967 jscholes@stanford.edu
#> 3 (650) 721-3969 ashantej@stanford.edu
#> 4 (650) 497-0648 ndarling@stanford.edu
#> 5 (650) 721-7212 ashley85@stanford.edu
#> 6 (650) 497-8189 jimenez5@stanford.edu
Created on 2024-09-11 with reprex v2.1.0