Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R + Rvest: retrieve files from github

Apologies for not providing a reprex, but if I could, I would not post this in the first place.
I need to retrieve the excel files containing the word "età" in their filename listed at the link

https://github.com/apalladi/covid_vaccini_monitoraggio/tree/main/dati

and also store their file names in a vector.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any idea about how to achieve that? I am thinking about using Rvest, but I am open to other reasonable suggestions.
Note that the list of files needs to be obtained from the github page, since it is not known a priori.
Thanks!

>Solution :

You should use the github API rather than scraping the website. This way, you can get the file names and the download links into a nice two-column data frame by doing:

library(httr)
library(dplyr)

req <- GET(paste0("https://api.github.com/repos/", 
                  "apalladi/covid_vaccini_monitoraggio/contents/dati"))

file_list <- content(req)
filenames <- sapply(file_list, function(x) x$name)

file_list <- file_list[grepl("xlsx$", filenames)]

tibble(file = sapply(file_list, function(x) x$name),
       link = sapply(file_list, function(x) x$download_url))
#> # A tibble: 30 x 2
#>    file                         link                                            
#>    <chr>                        <chr>                                           
#>  1 data_iss_età_2021-07-14.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  2 data_iss_età_2021-07-21.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  3 data_iss_età_2021-07-28.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  4 data_iss_età_2021-08-04.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  5 data_iss_età_2021-08-11.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  6 data_iss_età_2021-08-18.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  7 data_iss_età_2021-08-25.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  8 data_iss_età_2021-09-01.xlsx https://raw.githubusercontent.com/apalladi/covi~
#>  9 data_iss_età_2021-09-08.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> 10 data_iss_età_2021-09-15.xlsx https://raw.githubusercontent.com/apalladi/covi~
#> # ... with 20 more rows

Created on 2022-02-01 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading