Encountered "UseMethod("xml_find_all")" when using "html_nodes" for a list

February 21, 2023

I got this problems when I tried to use "html_nodes" with a list (profile_data_list).

library(tidyverse)
library(rvest)
list.mst <- c("0100111338" "0100105077" "0100110528" "0107464283" "0105342089")
url <- 'https://infodoanhnghiep.com/tim-kiem/ma-so-thue/'
link <- paste0(url, list.mst,'/')
profile_data_list <- lapply(link, function(x){search.result <- read_html(x)})
list <- profile_data_list %>% html_nodes(".company-name a") %>% html_attr('href') %>% unique()
com.page = paste0("https:",profile_data_list)

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"

I have used forin, but if I use forin the result I got is only about the last value in sequence. For example, If I use forin I only get the result of "0105342089". Therefore, I use the reapply function to read_html of a list.mst, but I have struggle when using html_nodes. I also tried to use (but still failed), as follow: list <- purrr::map(profile_data_list, ~ .x %>% html_nodes(".company-name a")%>% html_attr('href') %>% unique()) and list<-lapply(profile_data_list, function(x) x%>% html_nodes(".company-name a") %>% html_attr('href')%>% unique()). I really appreciate any suggestions. Thanks all!

>Solution :

library(tidyverse)
library(rvest)

link <- c("0100111338", "0100105077", "0100110528", "0107464283", "0105342089") %>% 
  str_c("https://infodoanhnghiep.com/tim-kiem/ma-so-thue/", ., "/")

scraper <- function(link) {
  cat("Scraping", link, "\n")
  link %>%
    read_html() %>%  
    html_elements(".company-item") %>% 
    map_dfr(~ tibble(
      link = .x %>% 
        html_element(".company-name a") %>% 
        html_attr("href") %>% 
        str_c("https:", .), 
      title = .x %>% 
        html_element(".company-name") %>% 
        html_text2(), 
      city = .x %>%  
        html_element(".description.hidden-xs") %>% 
        html_text2()
    )) %>%  
    mutate(source = link)
}

map_dfr(link, scraper)

# A tibble: 26 × 4
   link                                                        title city  source
   <chr>                                                       <chr> <chr> <chr> 
 1 https://infodoanhnghiep.com/thong-tin/Cong-Ty-Co-Phan-My-T… "C\u… "H\u… https…
 2 https://infodoanhnghiep.com/thong-tin/Cong-ty-TNHH-hoi-cho… "C\u… "H\u… https…
 3 https://infodoanhnghiep.com/thong-tin/Chi-Nhanh-Cty-My-Thu… "Chi… "TP … https…
 4 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-my… "Chi… "H\u… https…
 5 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-my… "Chi… "Th\… https…
 6 https://infodoanhnghiep.com/thong-tin/Cong-Ty-Co-Phan-Xay-… "C\u… "H\u… https…
 7 https://infodoanhnghiep.com/thong-tin/Chi-Nhanh-Cong-Ty-Co… "Chi… "H\u… https…
 8 https://infodoanhnghiep.com/thong-tin/Chi-nhanh-cong-ty-co… "Chi… "H\u… https…
 9 https://infodoanhnghiep.com/thong-tin/CHI-NHANH-CONG-TY-CO… "CHI… "H\u… https…
10 https://infodoanhnghiep.com/thong-tin/CHI-NHANH-CONG-TY-CO… "CHI… "H\u… https…
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows