Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting a table that spans multiple pages

I am attempting to extract a table that spans multiple pages in an old website.

https://botrank.pastimes.eu/

The site lists a series of bots by order of scores, good and bad votes, and link and comment karma. Preferably, I would like to extract the table by order of rank for all 318 pages, with the link https://botrank.pastimes.eu/?sort=rank&page=1 being an example of the first page.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The code that I tried was

pages <- seq(1:318)

bots <- lapply(pages, function(i){
  url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
  webpage <- url %>%
  httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
  read_html()
  data <- webpage %>%
    html_node("table") %>%
    html_table() %>%
    as_tibble()
  colnames(data) = data[1,]
})

bots_table <- do.call(rbind, bots)
head(bots_table, n=10)

Which gives me a good, clean tibble, but only with the first result of each page. Here is the output below.

# A tibble: 318 × 7
   Rank  `Bot Name`          Score Good Bo…¹ Bad B…² Comme…³ Link …⁴
   <chr> <chr>               <dbl> <chr>     <chr>   <chr>   <chr>  
 1 1     KickOpenTheDoorBot  0.993 20,877    119     38,594  98,297 
 2 251   NinNinBot           0.921 45        0       47      1      
 3 501   RegularEality       0.859 99        8       0       0      
 4 751   BillyCloneasaurus   0.806 16        0       267,779 9,350  
 5 1,001 MamataBot           0.758 12        0       357     5      
 6 1,251 slashy_potato_mashy 0.703 33        6       0       0      
 7 1,501 jimmy-b-bot         0.667 45        12      14,531  151    
 8 1,751 related_threads     0.616 23        6       1,727   1      
 9 2,001 RemoveMeNot         0.567 15        4       13,595  2      
10 2,251 python_boti         0.552 10        2       0       0      
# … with 308 more rows, and abbreviated variable names

The website source code seems standard, so I’m not sure why this is happening. I am also fairly new on web scraping. Any suggestions would be great!

<table class="table">
  <tr>
    <th>
        <div style="margin: 1px" class="glyphicon glyphicon-chevron-down"></div>
      
      <a href="/?sort=rank&order=reverse">Rank</a></th>
    <th>
      <a href="/?sort=name">Bot Name</a></th>
    <th>
      <a href="/?sort=score">Score</a></th>
    <th><a href="/?sort=good-votes">Good Bot Votes</a></th>
    <th>
      <a href="/?sort=bad-votes">Bad Bot Votes</a></th>
    <th>
      <a href="/?sort=comment-karma">Comment Karma</a></th>
    <th>
      <a href="/?sort=link-karma">Link Karma</a></th>
  </tr>
  
  <tr>
    <td>1</td>
    <td><a href= https://www.reddit.com/user/KickOpenTheDoorBot>KickOpenTheDoorBot</a></td>
    <td>0.9932</td>
    <td>20,877</td>
    <td>119</td>
    <td>38,594</td>
    <td>98,297</td>
  </tr>
  
  <tr>
    <td>2</td>
    <td><a href= https://www.reddit.com/user/Canna_Tips>Canna_Tips</a></td>
    <td>0.992</td>
    <td>18,045</td>
    <td>121</td>
    <td>49,670</td>
    <td>1</td>
  </tr>

>Solution :

The following works. The main difference is to use html_elements instead of html_node.

suppressPackageStartupMessages({
  library(rvest)
  library(httr)
  library(tidyverse)
})

pages <- 1:318

bots <- lapply(pages, function(i){
  url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
  webpage <- url %>%
    httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
    read_html()
  data <- webpage %>%
    html_elements("table") %>%
    html_table() %>%
    unlist(recursive = FALSE) %>%
    as_tibble()
  data
})

length(bots)
sapply(bots, dim)

Then rbind them together.

bots_table <- do.call(rbind, bots)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading