Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to sum values in a column based on string detection in another column and store it into a separate dataframe in R

I have a dataset of projects that I am trying to analyse based on themes and funding received.
I have a column "theme" with several keywords listed. I have all unique keywords possible in a list, and I am trying to sum, for each keyword, the total funding received, and store it in a dataframe.

As an example, my initial dataset would be:

df

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Projects Themes Funding
Project 1 Agriculture, Water, Food 50 000
Project 2 Agriculture, Digital 100 000
Project 3 Agriculture, Food 50 000
Project 4 Urban planning, Food, Water, Digital 200 000
Project 5 Agriculture 30 000

My list of keywords "keywords_list" is a list of characters containing "Agriculture", "Water", "Food", "Digital", "Urban planning".

The result I would like is:

themes_analysis

Keywords Count Funding
Agriculture 4 230 000
Water 2 250 000
Food 3 300 000
Digital 2 300 000
Urban planning 1 200 000

I already managed to have the first two columns thanks to the functions

themes_string_sep <- strsplit(df$Themes, split = ",")
keywords_list <- unlist(themes_string_sep)
themes_analysis <- as.data.frame(table(keywords_list))
keywords_list <- unique(keywords_list)

However I am having trouble finding how to sum the funding by searching for the keywords.
I tried different variations of ifelse function:
1.

themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((str_detect(df$Themes, themes_analysis$Keywords))== TRUE, sum(themes_analysis$sum_funding, df$Funding), themes_analysis$Funding)

which returns "Error in str_detect(): ! Can’t recycle string (size 312) to match pattern (size 93). Backtrace: 1. base::ifelse(…) 2. stringr::str_detect(df$Themes, themes_analysis$Funding)"

2)

themes_analysis$Funding <- 0
themes_analysis$Funding <- ifelse((df$Themes %in% themes_analysis$Keywords))== TRUE, sum(themes_analysis$Funding, df$Funding), themes_analysis$Funding)

which returns the error "Error in $<-.data.frame(*tmp*, Funding, value = c(318793741, 318793741, : the replacement table has 312 lines, the replaced table has 93 lines"

  1. I also tried:
    themes_analysis$Funding <- sum(df[which(df$Themes %in% themes_analysis$Keywords), 3])
    which gives me the same incorrect number for every row, which is not what I want.

I cannot find any answer online so far.

Does anyone have any idea how to solve this?

>Solution :

Using dplyr and tidyr you can use

dd |> 
  tidyr::separate_longer_delim(Themes, ", ") |>
  dplyr::summarize(count= dplyr::n(), funding=sum(Funding), .by=Themes)

To get

          Themes count funding
1    Agriculture     4  230000
2          Water     2  250000
3           Food     3  300000
4        Digital     2  300000
5 Urban planning     1  200000

Don’t worry about searching the columns yourself, just split the multivalued cells into a tidy format by creating separate rows for each value.

Tested with

dd <- data.frame(Projects = c("Project 1", "Project 2", "Project 3", "Project 4", "Project 5"), 
                 Themes = c("Agriculture, Water, Food", "Agriculture, Digital", "Agriculture, Food", "Urban planning, Food, Water, Digital", "Agriculture"), 
                 Funding = c(50000L, 100000L, 50000L, 200000L, 30000L))

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading