Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create data frame from rvest scraped website, preserving nested structure of data

Say that I use read_html_live() from the rvest package to pull some code that looks like this:

books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>    
  </div>')

I would like to use the rvest package to create a data frame (or tibble would also be fine) with the information contained above. I would like it to be organized at the author level, so each row will contain an author, the booktitle, and the year.

If I only cared about the first author, it would be easy. Something like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data0 <- books %>% html_elements(".book")
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()
author1 <- data0 %>% html_element("author") %>% html_text2()
data <- data.frame(title, year, author1)

However, I would actually like to extract all authors, the authors being "children" within book. And the dataframe would now have eight rows, one for each author. For instance, row 8 would have Book 3, 1845, and Author 8. How can I do this?

Here is a rough idea, but I am looking for easier solutions:

data0 <- books %>% html_elements(".book") 
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()

authors <- data0 %>% html_element(".author")

And then loop over the three elements of authors and save each of them to a dataframe. And then associate each of these author dataframes with the relevant title and year and somehow transform it to be a long data frame.

>Solution :

Here is one approach which uses lapply to loop over the book nodes:

library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>
  </div>')

data0 <- books %>%
  html_elements(".book") |>
  lapply(\(x) {
    tibble(
      title = x |> html_element(".booktitle") |> html_text2(),
      year = x |> html_element(".year") |> html_text2(),
      authors = x |> html_elements(".author") |> html_text2(),
    )
  }) |>
  bind_rows()

data0
#> # A tibble: 8 × 3
#>   title  year  authors 
#>   <chr>  <chr> <chr>   
#> 1 Book 1 1999  Author 1
#> 2 Book 1 1999  Author 2
#> 3 Book 1 1999  Author 3
#> 4 Book 2 2022  Author 4
#> 5 Book 3 1845  Author 5
#> 6 Book 3 1845  Author 6
#> 7 Book 3 1845  Author 7
#> 8 Book 3 1845  Author 8
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading