Custom function with dplyr::summarise with conditions

July 19, 2024

I want to create a function named ratio_function that does the same as the following code:

data = data %>% 
  group_by(ID) %>% 
  summarise(sum_ratio = sum(surface[category == "A"], na.rm = T)/sum(total_area[category == "A"], na.rm = T)*mean(`MEAN`[category == "A"], na.rm = T))

but inside of summarise such as:

data = data %>% 
  group_by(ID) %>% 
  summarise(sum_ratio = ratio_function("A"))

The problem is that surface, total_area and category aren’t recognized as variable name in summarise once they are called in the function.

>Solution :

When creating a function, you have to add all objects you want to pass inside the function as arguments for the function itself. In your case, your function probably can’t find the columns because the function does not specify them as arguments, therefore they don’t exist inside the function. You have to simply add the variable names as arguments, like this:

ratio_function <- function(surface, total_area, MEAN, category, selected_category = "A") {
  sum(surface[category == "A"], na.rm = T)/sum(total_area[category == selected_category], na.rm = T)*mean(`MEAN`[category == selected_category], na.rm = T)
}

data %>% 
  group_by(ID) %>% 
  summarise(sum_ratio = ratio_function(surface, total_area, MEAN, category, "A"))

In this case, I added the variable names as arguments for the function, but when using the function you can specify different columns to use for each part of your calculation. For example, exchanging surface for another column. This will probably create confusion in the future, and you may want to rewrite your function so that the arguments are more descriptive of what they do instead of simply being the names of the columns you had in your data.