I have been working on step by step solution to find the correspondence author from the collections_html_subpages.
I inspected the website and saw that it was a <a id="corresp-c1" href="mailto:FName@email.com> FName LName</a>
I built the following code. The code works as follows it uses the initial page and mines for the href of the individual articles. Then it supposed to using html_node find that tag in one of the individual articles. Now using lapply and html_text I should be able to extract all the correspondence authors mainly just 1. However, I am stuck even just getting the tag. I do not know where the mistake is in code.
Both correspondence_authors. and t1 return an empty set. Any advice on how I could improve my code to get the desired result would be appreciated.
library(httr) # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(xml2)
library(tidyr) #will be used to remove NA
library(tidyverse)
article_year <- function(year){
}
str_1 <- "https://molecularbrain.biomedcentral.com/articles"
prefix_str_1 <- "https://molecularbrain.biomedcentral.com/"
doc <- httr::GET(str_1)
html <- read_html(content(doc, "text"))
#################### Title ####################
c_listing_title <- html_elements(html,"h3.c-listing__title")
a_element <- html_node(c_listing_title,"a")
a_href <- as.list(html_attr(a_element,"href"))
a_text <- lapply(a_element,html_text)
##################### 2 Page Depth #######################
merge_strings <- function(x){
paste0(prefix_str_1,x)
}
sub_pages <- lapply(a_href,merge_strings)
########################Function_Read_Sub_Pages#####################
read_page_1 <- function(x){
webpages <- httr::GET(x)
html <- rvest::read_html(httr::content(webpages, "text"))
return(html)
}
collection_html_sub_pages <- lapply(sub_pages,read_page_1)
##########################Correspondence_Author###################
correspondence_search <- function(x){
rvest::html_node(x,"a#corresp-c1")
}
collection_html_sub_pages[[1]]
t1 <- rvest::html_element(collection_html_sub_pages[[1]],paste0('#corresp-c1'))
t2 <- rvest::html_elements(t1,"p")
correspondence_authors <- lapply(collection_html_sub_pages, correspondence_search)
I have used helper functions to help construct my code and would to keep using helper functions to keep my code well organized and allow for troubleshooting. I have tried the code above and the rest works but the part of getting the correspondence author.
>Solution :
The article URLs you create are not valid paths on that web server. When you paste() prefix_str_1 and a_href, the first ends with a / and the latter starts with a / and the resulting URLs look like this: https://molecularbrain.biomedcentral.com/articles//10.1186/s13041-023-01014-0; the correct URL would be https://molecularbrain.biomedcentral.com/articles/10.1186/s13041-023-01014-0 (no double / after articles).
Easiest fix is to define prefix_str_1 with out a tailing /.
prefix_str_1 <- "https://molecularbrain.biomedcentral.com"
You can also significantly simplify your code.
library(rvest)
base_url <- "https://molecularbrain.biomedcentral.com"
index_html <- read_html(file.path(base_url, "articles"))
# Title and Links ---------------------------------------------------------
a_elements <- html_elements(index_html, "h3.c-listing__title a")
a_href <- html_attr(a_elements, "href")
a_text <- html_text(a_elements)
# subpages ----------------------------------------------------------------
html_sub_pages <-
lapply(paste0(base_url, a_href),
read_html)
# Correspondence Author ---------------------------------------------------
lapply(html_sub_pages,
html_elements,
"#corresp-c1") |>
lapply(html_text)
#> [[1]]
#> [1] "Chao Qin"
#>
#> [[2]]
#> [1] "Won Do Heo"
#>
#> [[3]]
#> [1] "Seung-Jae Lee"
#> ...