Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract elements from list of httr headers

I have a simple question that has nonetheless stumped me.

I am trying to extract specific elements from a list of website headers for a set of given URLs. I have obtained the website headers using the httr package. Using magrittr::extract, I am able to successfully extract one element from the header for each URL and include this element in a tibble. However, I am having difficulty figuring out how to extract more than one element from the header for each URL.

For example, the below code helps me successfully extract the "status_code" for each URL and include it within a tibble. There is only one "status_code" for each URL.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pacman::p_load(httr, rvest, dplyr, purrr, tidyr)

some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=4",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=140",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=3"
)

df <- map_dfr(some_urls, ~{
    httr::GET(.x) %>% 
    magrittr::extract(c("url", "status_code"))
})

However, I am not interested in "status_code," but in "status." There may be MORE than one "status" for each URL. I am interested in extracting EVERY "status" for each URL and adding it to a tibble.

The below code does not work, because there is more than one "status" for each URL.

df <- map_dfr(some_urls, ~{
  httr::GET(.x) %>% 
  magrittr::extract(c("url", "status"))
})

This code gives me the following result:

Error:
! Column names `url`, `url`, and `url` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
  * "url" at locations 1, 2, 3, and 4.
Backtrace:
 1. purrr::map_dfr(...)
 2. dplyr::bind_rows(res, .id = .id)
 4. tibble:::as_tibble.list(dots)
 5. tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
 6. tibble:::set_repaired_names(x, repair_hint = TRUE, .name_repair)
 8. tibble:::repaired_names(= NULL)
 Error: 
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
* "url" at locations 1, 2, 3, and 4.

I greatly appreciate any advice you may have! If inserting "name_repair" somewhere into my code is the answer, I have been unable to figure out how to successfully use this in my code. I have also tried setting column names in advance but seem to be unsuccessfully able to do this too. Please let me if you have any advice regarding how I can successfully extract this information!

>Solution :

We may paste the multiple code into a single string – all_headers is a list which can vary in length from 1 to n. If there are more elements, loop over the all_headers with map, pluck the ‘status’ from each of those elements and either paste (toString)

library(purrr)
library(dplyr)
map_dfr(some_urls, ~{
    httr::GET(.x, user_agent) %>%
       {tibble(url = .$url,
               status = toString(unlist(map(.$all_headers, pluck, "status"))))}
    })

-output

# A tibble: 4 × 2
  url                                                              status  
  <chr>                                                            <chr>   
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10 200     
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4  200     
3 https://www.psychologytoday.com/us/therapists/new-york           302, 200
4 https://www.psychologytoday.com/us/therapists/new-york/a?page=3  200     

or return a list and then unnest the list column later

library(tidyr)
map_dfr(some_urls, ~{
    httr::GET(.x, user_agent) %>%
       {tibble(url = .$url,
    status = map(.$all_headers, pluck, "status"))}
    }) %>% 
   unnest(status)

-output

# A tibble: 5 × 2
  url                                                              status
  <chr>                                                             <int>
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10    200
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4     200
3 https://www.psychologytoday.com/us/therapists/new-york              302
4 https://www.psychologytoday.com/us/therapists/new-york              200
5 https://www.psychologytoday.com/us/therapists/new-york/a?page=3     200
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading