Extract part of listcolumn using tidyverse functions

July 8, 2023

Given the dataframe ‘dat’, where ‘author’ is a list column of author names. How can I create a new column that contains the first author’s last name only using tidyverse functions?

dat <- structure(list(author = list(c("Pagsberg, Anne Katrine", "Uhre, Camilla", 
"Uhre, Valdemar and"), c("Franklin, Martin E", "Sapyta, Jeffrey", 
"Freeman, Jennifer B"), c("Selles, Robert R", "Belschner, Laura", 
"Negreiros, Juliana and")), pmid = c("35305587", "21934055", 
"29179016")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", 
"data.frame"))

In base R, the following code works:
dat$first_author <- sapply(strsplit(sapply(dat$author, "[[", 1), ","), "[", 1)

>Solution :

One pure tidyverse approach would be to group the tibble rowwise and pluck out the first element of each row in the list column before using str_remove to get rid of the first comma plus anything after it. For completeness you can ungroup at the end.

library(tidyverse)

dat %>% 
  rowwise() %>% 
  mutate(first_author = pluck(author, 1) %>% str_remove(',.*$')) %>%
  ungroup()
#> # A tibble: 3 x 3
#>   author    pmid     first_author
#>   <list>    <chr>    <chr>       
#> 1 <chr [3]> 35305587 Pagsberg    
#> 2 <chr [3]> 21934055 Franklin    
#> 3 <chr [3]> 29179016 Selles

However, in reality I feel no compulsion to use tidyverse functions when a good one-liner base R alternative exists:

within(dat, first_author <- sapply(author, \(x) gsub(',.*$', '', x[[1]])))
#> # A tibble: 3 x 3
#>   author    pmid     first_author
#>   <list>    <chr>    <chr>       
#> 1 <chr [3]> 35305587 Pagsberg    
#> 2 <chr [3]> 21934055 Franklin    
#> 3 <chr [3]> 29179016 Selles