Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

webscraping: capture links of references with R

I want to capture the links to references from an article on this page:
https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es

I have tried this:

 library(rvest)
library(dplyr)
 link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
 links <- page %>% 
    html_nodes("a") %>%
    html_text()

But these are not the links that I want to.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

There are 68 references so I want the 68 links attached to those references

>Solution :

I have been looking the site and found that the [ links ] labels runs some javascript at onclick event that sends you to an intermediate site, page etc. Thus so far it is not easy to scrap from them.
I found this solution that matches 65 of the 68 links written as text in the "#article-back" section. It seems three links are not well formatted thus not matched (i.e. "h ttp://"). I hope it has been helpful.

Edit:
Regexp taken from this answer

library(rvest)
library(dplyr)
 
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)

text <- page %>% html_node("#article-back") %>% 
    html_text()

 
matches <- gregexpr(
  "\\b(https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]",
  links)
 
links <- regmatches(links, matches)

Edit 2
For scrap from the javascript in onclick:

library(rvest)
library(dplyr)
 
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)

text <- page %>% html_node("#article-back") %>% 
    html_nodes("a") %>% html_attr("onclick") 

links <- gsub(".*(/[^']+).*", "https://www.scielo.org.mx/\\1", text[!is.na(text)])

links_pid <- gsub(".*pid=([^&]+)&.*", "\\1", links)
links_pid

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading