Home How to sum values in a column based on string detection in another column and store it into a separate dataframe in R

Questions

How to sum values in a column based on string detection in another column and store it into a separate dataframe in R

byMR

November 14, 2024

I have a dataset of projects that I am trying to analyse based on themes and funding received.
I have a column "theme" with several keywords listed. I have all unique keywords possible in a list, and I am trying to sum, for each keyword, the total funding received, and store it in a dataframe.

As an example, my initial dataset would be:

Projects	Themes	Funding
Project 1	Agriculture, Water, Food	50 000
Project 2	Agriculture, Digital	100 000
Project 3	Agriculture, Food	50 000
Project 4	Urban planning, Food, Water, Digital	200 000
Project 5	Agriculture	30 000

My list of keywords "keywords_list" is a list of characters containing "Agriculture", "Water", "Food", "Digital", "Urban planning".

The result I would like is:

themes_analysis

Keywords	Count	Funding
Agriculture	4	230 000
Water	2	250 000
Food	3	300 000
Digital	2	300 000
Urban planning	1	200 000

I already managed to have the first two columns thanks to the functions

themes_string_sep <- strsplit(df$Themes, split = ",")
keywords_list <- unlist(themes_string_sep)
themes_analysis <- as.data.frame(table(keywords_list))
keywords_list <- unique(keywords_list)

However I am having trouble finding how to sum the funding by searching for the keywords.
I tried different variations of ifelse function:
1.

themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((str_detect(df$Themes, themes_analysis$Keywords))== TRUE, sum(themes_analysis$sum_funding, df$Funding), themes_analysis$Funding)

which returns "Error in str_detect(): ! Can’t recycle string (size 312) to match pattern (size 93). Backtrace: 1. base::ifelse(…) 2. stringr::str_detect(df$Themes, themes_analysis$Funding)"

themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((df$Themes %in% themes_analysis$Keywords))== TRUE, sum(themes_analysis$Funding, df$Funding), themes_analysis$Funding)

which returns the error "Error in $<-.data.frame(*tmp*, Funding, value = c(318793741, 318793741, : the replacement table has 312 lines, the replaced table has 93 lines"

I also tried:
themes_analysis$Funding <- sum(df[which(df$Themes %in% themes_analysis$Keywords), 3])
which gives me the same incorrect number for every row, which is not what I want.

I cannot find any answer online so far.

Does anyone have any idea how to solve this?

>Solution :

Using dplyr and tidyr you can use

dd |> 
  tidyr::separate_longer_delim(Themes, ", ") |>
  dplyr::summarize(count= dplyr::n(), funding=sum(Funding), .by=Themes)

To get

          Themes count funding
1    Agriculture     4  230000
2          Water     2  250000
3           Food     3  300000
4        Digital     2  300000
5 Urban planning     1  200000

Don’t worry about searching the columns yourself, just split the multivalued cells into a tidy format by creating separate rows for each value.

Tested with

dd <- data.frame(Projects = c("Project 1", "Project 2", "Project 3", "Project 4", "Project 5"), 
                 Themes = c("Agriculture, Water, Food", "Agriculture, Digital", "Agriculture, Food", "Urban planning, Food, Water, Digital", "Agriculture"), 
                 Funding = c(50000L, 100000L, 50000L, 200000L, 30000L))