Calculate values based on words in strings

July 26, 2022

I have two data frames:

df <- data.frame(ID = 1:10, year = c(2005, 2005, 2006, 2007, 2007, 2005, 2006, 2007, 2006, 2005), resources = c("gold", "silver", "bronze", "gold, bronze", "silver, bronze", "gold", "gold, silver, bronze", "bronze", "gold, silver", "silver"))

and

prices <- data.frame(year = c(2005:2007, 2005:2007, 2005:2007), resource = c("gold", "gold", "gold", "silver", "silver", "silver", "bronze", "bronze", "bronze"), price = c(10, 11, 12, 8, 6, 7, 5, NA, NA))

which look like

   ID year            resources
1   1 2005                 gold
2   2 2005               silver
3   3 2006               bronze
4   4 2007         gold, bronze
5   5 2007       silver, bronze
6   6 2005                 gold
7   7 2006 gold, silver, bronze
8   8 2007               bronze
9   9 2006         gold, silver
10 10 2005               silver

and

  year resource price
1 2005     gold    10
2 2006     gold    11
3 2007     gold    12
4 2005   silver     8
5 2006   silver     6
6 2007   silver     7
7 2005   bronze     5
8 2006   bronze    NA
9 2007   bronze    NA

respectively.

I want to calculate the mean price for each observation in df by taking prices from prices. I.e. if one observation in 2005 only has gold, the mean price should just be the price of gold in 2005, if it has gold and silver it should be the mean of the two prices in the correct year, if one price is NA then the NA value should just be ignored.
The result should look like

   ID year            resources new_price
1   1 2005                 gold 10
2   2 2005               silver 8
3   3 2006               bronze NA
4   4 2007         gold, bronze 12
5   5 2007       silver, bronze 7
6   6 2005                 gold 10
7   7 2006 gold, silver, bronze 8.5
8   8 2007               bronze NA
9   9 2006         gold, silver 8.5
10 10 2005               silver 8

>Solution :

Try:

This method uses stringr to form the resource column into a list,
we then unnest the column to left join the other table,
then it’s plug and play.

 df |> 
  mutate(resource = stringr::str_extract_all(resource, "[^, ]{1,}")) |> 
  unnest(cols = resource) |> 
    left_join(prices, by = c("year", "resource")) |> 
    group_by(ID) |> 
    summarise(resource = paste(unique(resource), collapse = ", "), price = mean(price, na.rm = TRUE ), .groups = "drop")

      ID resource             price
   <int> <chr>                <dbl>
 1     1 gold                  10  
 2     2 silver                 8  
 3     3 bronze               NaN  
 4     4 gold, bronze          12  
 5     5 silver, bronze         7  
 6     6 gold                  10  
 7     7 gold, silver, bronze   8.5
 8     8 bronze               NaN  
 9     9 gold, silver           8.5
10    10 silver                 8