Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculate values based on words in strings

I have two data frames:

df <- data.frame(ID = 1:10, year = c(2005, 2005, 2006, 2007, 2007, 2005, 2006, 2007, 2006, 2005), resources = c("gold", "silver", "bronze", "gold, bronze", "silver, bronze", "gold", "gold, silver, bronze", "bronze", "gold, silver", "silver"))

and

prices <- data.frame(year = c(2005:2007, 2005:2007, 2005:2007), resource = c("gold", "gold", "gold", "silver", "silver", "silver", "bronze", "bronze", "bronze"), price = c(10, 11, 12, 8, 6, 7, 5, NA, NA))

which look like

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   ID year            resources
1   1 2005                 gold
2   2 2005               silver
3   3 2006               bronze
4   4 2007         gold, bronze
5   5 2007       silver, bronze
6   6 2005                 gold
7   7 2006 gold, silver, bronze
8   8 2007               bronze
9   9 2006         gold, silver
10 10 2005               silver

and

  year resource price
1 2005     gold    10
2 2006     gold    11
3 2007     gold    12
4 2005   silver     8
5 2006   silver     6
6 2007   silver     7
7 2005   bronze     5
8 2006   bronze    NA
9 2007   bronze    NA

respectively.

I want to calculate the mean price for each observation in df by taking prices from prices. I.e. if one observation in 2005 only has gold, the mean price should just be the price of gold in 2005, if it has gold and silver it should be the mean of the two prices in the correct year, if one price is NA then the NA value should just be ignored.
The result should look like

   ID year            resources new_price
1   1 2005                 gold 10
2   2 2005               silver 8
3   3 2006               bronze NA
4   4 2007         gold, bronze 12
5   5 2007       silver, bronze 7
6   6 2005                 gold 10
7   7 2006 gold, silver, bronze 8.5
8   8 2007               bronze NA
9   9 2006         gold, silver 8.5
10 10 2005               silver 8

>Solution :

Try:

This method uses stringr to form the resource column into a list,
we then unnest the column to left join the other table,
then it’s plug and play.

 df |> 
  mutate(resource = stringr::str_extract_all(resource, "[^, ]{1,}")) |> 
  unnest(cols = resource) |> 
    left_join(prices, by = c("year", "resource")) |> 
    group_by(ID) |> 
    summarise(resource = paste(unique(resource), collapse = ", "), price = mean(price, na.rm = TRUE ), .groups = "drop")
      ID resource             price
   <int> <chr>                <dbl>
 1     1 gold                  10  
 2     2 silver                 8  
 3     3 bronze               NaN  
 4     4 gold, bronze          12  
 5     5 silver, bronze         7  
 6     6 gold                  10  
 7     7 gold, silver, bronze   8.5
 8     8 bronze               NaN  
 9     9 gold, silver           8.5
10    10 silver                 8 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading