Home How can I use a value extracted from a data table to specify columns to subset in R?

Questions

How can I use a value extracted from a data table to specify columns to subset in R?

May 17, 2023

I have a dataframe that I want to subset inside a function so that only rows where both columns are either 1 or NA remain. For df:

df <- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

I want:

The problem I’m having is I have many columns with names that change. So this works well:

subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))

but when column names ‘a’ and ‘b’ become ‘d’ and ‘f’ during the operation of the function it breaks. Specifying by column index works more robustly:

subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))

But is cumbersome, and if a previous processing step messes up and column ‘c’ ends up before ‘a’ or ‘b’ I end up subsetting by the wrong columns.

I also have another dataframe that specifies what the column names to subset by will be:

cro_df <- data.frame(pop = c('c1', 'c2'),
                     p1 = c('a', 'd'),
                     p2 = c('b', 'f'))
  pop p1 p2
1  c1  a  d
2  c2  b  f

I would like to be able to extract the column names from that dataframe to use in my subset function, e.g.:

col1 <- cro_df[cro_df[,'pop']=='c1', 'p1']
subset(df, is.na(col1) | col1 == 1)

This returns an empty dataframe. I have tried turning col1 into a symbol and a factor with no success:

subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)

And they all return:

[1] a b c
<0 rows> (or 0-length row.names)

Is there a way I can specify my columns to subset using the second dataframe cro_df?

>Solution :

Perhaps this is a good start?

with(cro_df[cro_df$pop == "c1",],
  df[ (is.na(df[[p1]]) | df[[p1]] == 1) & (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
#    a  b c
# 2  1  1 0
# 4 NA  1 0
# 6  1 NA 0

FYI, subset is intended for interactive use, its help page says

Warning:

     This is a convenience function intended for use interactively.
     For programming it is better to use the standard subsetting
     functions like [, and in particular the non-standard evaluation
     of argument ‘subset’ can have unanticipated consequences.