How to filter on colnames within function

Lets say I have the following dataframe

  sample genea geneb genec gened genee genef
1      1     1     1     1     0     0     0
2      2     1     1     1     0     0     0
3      3     1     0     0     1     1     1
4      4     0     0     0     0     0     0
5      5     1     0     1     1     1     1
6      6     0     0     0     0     0     0

and what I am trying to do is write a function that lets me specify what column to filter on :

test <- function(gene){
t <- df %>% filter(gene == 1) %>% dplyr::select(sample)
}

but when I do:

test(genea), this does not work because genea is no object.
test("genea"), this does not work because i am not trying to select a character vector but a column name.

test(unquote("genea")), I thought that would work but it does not.

So the question is. How do you give R colnames to filter out of within a function.

>Solution :

Due to the non-standard evaluation used in the dplyr verbs, filter will always look for an actual variable called gene in your data frame, rather than looking for the column name being passed to the function. To get around this, you need to capture the symbol being passed to the function (known as "quoting"), then tell filter that you want to "unquote" that variable.

Since rlang 0.4.0 (which contains much of the machinery behind the tidyverse’s use of non-standard evaluation), we have been able to achieve these two operations in a single step, using the curly-curly syntax filter({{gene}} == 1)

Note also that your function stores to the variable t, but doesn’t return anything, so a working version of your function would be:

library(dplyr)

test <- function(gene) { 
  df %>% filter({{gene}} == 1) %>% select(sample) 
}

We can see that this does the trick:

test(genea)
#>   sample
#> 1      1
#> 2      2
#> 3      3
#> 5      5

A further point to note would be that it is not great practice to use the name of variables inside your function that are presumed to exist in the calling scope. Rather than your function referring to df, you should probably have your function explicitly take a data argument, and rather than assuming your data has a column called sample, you should pass the column you wish to select.

Your function might then look like this:

test <- function(data, gene, col) { 
  data %>% filter({{gene}} == 1) %>% select({{col}}) 
}

And be called like this

test(df, genea, sample)

This gives the same result, but is a more useful function which can be used whatever the names of your data frame and sample column are.

Created on 2022-11-10 with reprex v2.0.2


Data in reproducible format

df <- structure(list(sample = 1:6, genea = c(1L, 1L, 1L, 0L, 1L, 0L
), geneb = c(1L, 1L, 0L, 0L, 0L, 0L), genec = c(1L, 1L, 0L, 0L, 
1L, 0L), gened = c(0L, 0L, 1L, 0L, 1L, 0L), genee = c(0L, 0L, 
1L, 0L, 1L, 0L), genef = c(0L, 0L, 1L, 0L, 1L, 0L)), 
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))

Leave a Reply