I have these two example html: url1.html ; url2.html
In URL1.html there is no information (71) and in URL2.html there is.
I’m using this code in R:
library(rvest)
library(tidyverse)
x<-data.frame(
URL=c(1:2),
page=c(paste(readLines("url1.html"), collapse="\n"),
paste(readLines("url2.html"), collapse="\n"))
)
for (i in 1:nrow(x)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[@id="principal"]/table[2]') %>%
rvest::html_elements(xpath = '//div[@id="tituloContext"]') %>%
html_text()%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> x$title[i]
}
Result: title
[1] "Â CARRINHO DE LIXO PARA LIMPEZA URBANA"
character(0)
Problem: although I’m bringing the correct content from URL1, I would like to save the "-" value when it doesn’t exist (e.g. URL2)
Expected output: not available (ND).
[1] "Â CARRINHO DE LIXO PARA LIMPEZA URBANA"
[1] "ND"
Any idea how to solve this problem?
Is it possible to optimize this code as well?
>Solution :
We could check the length and if it is 0 (length(character(0)) is 0), change the value to ‘ND’
for (i in seq_len(nrow(x))){
html<-x$page[i]%>%
unclass() %>%
unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[@id="principal"]/table[2]') %>%
rvest::html_elements(xpath = '//div[@id="tituloContext"]') %>%
html_text()%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$title[i] <- tmp
}
-checking
> x$title
[1] "CARRINHO DE LIXO PARA LIMPEZA URBANA" "ND"