I have a dataset of projects that I am trying to analyse based on themes and funding received.
I have a column "theme" with several keywords listed. I have all unique keywords possible in a list, and I am trying to sum, for each keyword, the total funding received, and store it in a dataframe.
As an example, my initial dataset would be:
df
| Projects | Themes | Funding |
|---|---|---|
| Project 1 | Agriculture, Water, Food | 50 000 |
| Project 2 | Agriculture, Digital | 100 000 |
| Project 3 | Agriculture, Food | 50 000 |
| Project 4 | Urban planning, Food, Water, Digital | 200 000 |
| Project 5 | Agriculture | 30 000 |
My list of keywords "keywords_list" is a list of characters containing "Agriculture", "Water", "Food", "Digital", "Urban planning".
The result I would like is:
themes_analysis
| Keywords | Count | Funding |
|---|---|---|
| Agriculture | 4 | 230 000 |
| Water | 2 | 250 000 |
| Food | 3 | 300 000 |
| Digital | 2 | 300 000 |
| Urban planning | 1 | 200 000 |
I already managed to have the first two columns thanks to the functions
themes_string_sep <- strsplit(df$Themes, split = ",")
keywords_list <- unlist(themes_string_sep)
themes_analysis <- as.data.frame(table(keywords_list))
keywords_list <- unique(keywords_list)
However I am having trouble finding how to sum the funding by searching for the keywords.
I tried different variations of ifelse function:
1.
themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((str_detect(df$Themes, themes_analysis$Keywords))== TRUE, sum(themes_analysis$sum_funding, df$Funding), themes_analysis$Funding)
which returns "Error in str_detect(): ! Can’t recycle string (size 312) to match pattern (size 93). Backtrace: 1. base::ifelse(…) 2. stringr::str_detect(df$Themes, themes_analysis$Funding)"
2)
themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((df$Themes %in% themes_analysis$Keywords))== TRUE, sum(themes_analysis$Funding, df$Funding), themes_analysis$Funding)
which returns the error "Error in $<-.data.frame(*tmp*, Funding, value = c(318793741, 318793741, : the replacement table has 312 lines, the replaced table has 93 lines"
- I also tried:
themes_analysis$Funding <- sum(df[which(df$Themes %in% themes_analysis$Keywords), 3])
which gives me the same incorrect number for every row, which is not what I want.
I cannot find any answer online so far.
Does anyone have any idea how to solve this?
>Solution :
Using dplyr and tidyr you can use
dd |>
tidyr::separate_longer_delim(Themes, ", ") |>
dplyr::summarize(count= dplyr::n(), funding=sum(Funding), .by=Themes)
To get
Themes count funding
1 Agriculture 4 230000
2 Water 2 250000
3 Food 3 300000
4 Digital 2 300000
5 Urban planning 1 200000
Don’t worry about searching the columns yourself, just split the multivalued cells into a tidy format by creating separate rows for each value.
Tested with
dd <- data.frame(Projects = c("Project 1", "Project 2", "Project 3", "Project 4", "Project 5"),
Themes = c("Agriculture, Water, Food", "Agriculture, Digital", "Agriculture, Food", "Urban planning, Food, Water, Digital", "Agriculture"),
Funding = c(50000L, 100000L, 50000L, 200000L, 30000L))