I got this problems when I tried to use "html_nodes" with a list (profile_data_list).
library(tidyverse)
library(rvest)
list.mst <- c("0100111338" "0100105077" "0100110528" "0107464283" "0105342089")
url <- 'https://infodoanhnghiep.com/tim-kiem/ma-so-thue/'
link <- paste0(url, list.mst,'/')
profile_data_list <- lapply(link, function(x){search.result <- read_html(x)})
list <- profile_data_list %>% html_nodes(".company-name a") %>% html_attr('href') %>% unique()
com.page = paste0("https:",profile_data_list)
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"
I have used forin, but if I use forin the result I got is only about the last value in sequence. For example, If I use forin I only get the result of "0105342089". Therefore, I use the reapply function to read_html of a list.mst, but I have struggle when using html_nodes. I also tried to use (but still failed), as follow: list <- purrr::map(profile_data_list, ~ .x %>% html_nodes(".company-name a")%>% html_attr('href') %>% unique())
and list<-lapply(profile_data_list, function(x) x%>% html_nodes(".company-name a") %>% html_attr('href')%>% unique())
. I really appreciate any suggestions. Thanks all!
>Solution :
library(tidyverse)
library(rvest)
link <- c("0100111338", "0100105077", "0100110528", "0107464283", "0105342089") %>%
str_c("https://infodoanhnghiep.com/tim-kiem/ma-so-thue/", ., "/")
scraper <- function(link) {
cat("Scraping", link, "\n")
link %>%
read_html() %>%
html_elements(".company-item") %>%
map_dfr(~ tibble(
link = .x %>%
html_element(".company-name a") %>%
html_attr("href") %>%
str_c("https:", .),
title = .x %>%
html_element(".company-name") %>%
html_text2(),
city = .x %>%
html_element(".description.hidden-xs") %>%
html_text2()
)) %>%
mutate(source = link)
}
map_dfr(link, scraper)
# A tibble: 26 × 4
link title city source
<chr> <chr> <chr> <chr>
1 https://infodoanhnghiep.com/thong-tin/Cong-Ty-Co-Phan-My-T… "C\u… "H\u… https…
2 https://infodoanhnghiep.com/thong-tin/Cong-ty-TNHH-hoi-cho… "C\u… "H\u… https…
3 https://infodoanhnghiep.com/thong-tin/Chi-Nhanh-Cty-My-Thu… "Chi… "TP … https…
4 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-my… "Chi… "H\u… https…
5 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-my… "Chi… "Th\… https…
6 https://infodoanhnghiep.com/thong-tin/Cong-Ty-Co-Phan-Xay-… "C\u… "H\u… https…
7 https://infodoanhnghiep.com/thong-tin/Chi-Nhanh-Cong-Ty-Co… "Chi… "H\u… https…
8 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-co… "Chi… "H\u… https…
9 https://infodoanhnghiep.com/thong-tin/CHI-NHANH-CONG-TY-CO… "CHI… "H\u… https…
10 https://infodoanhnghiep.com/thong-tin/CHI-NHANH-CONG-TY-CO… "CHI… "H\u… https…
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows