I have a dataframe that I want to subset inside a function so that only rows where both columns are either 1 or NA remain. For df:
df <- data.frame(a = c(1,1,0,NA,0,1),
b = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
I want:
a b c
2 1 1 0
4 NA 1 0
6 1 NA 0
The problem I’m having is I have many columns with names that change. So this works well:
subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))
but when column names ‘a’ and ‘b’ become ‘d’ and ‘f’ during the operation of the function it breaks. Specifying by column index works more robustly:
subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))
But is cumbersome, and if a previous processing step messes up and column ‘c’ ends up before ‘a’ or ‘b’ I end up subsetting by the wrong columns.
I also have another dataframe that specifies what the column names to subset by will be:
cro_df <- data.frame(pop = c('c1', 'c2'),
p1 = c('a', 'd'),
p2 = c('b', 'f'))
pop p1 p2
1 c1 a d
2 c2 b f
I would like to be able to extract the column names from that dataframe to use in my subset function, e.g.:
col1 <- cro_df[cro_df[,'pop']=='c1', 'p1']
subset(df, is.na(col1) | col1 == 1)
This returns an empty dataframe. I have tried turning col1 into a symbol and a factor with no success:
subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)
And they all return:
[1] a b c
<0 rows> (or 0-length row.names)
Is there a way I can specify my columns to subset using the second dataframe cro_df?
>Solution :
Perhaps this is a good start?
with(cro_df[cro_df$pop == "c1",],
df[ (is.na(df[[p1]]) | df[[p1]] == 1) & (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
# a b c
# 2 1 1 0
# 4 NA 1 0
# 6 1 NA 0
FYI, subset is intended for interactive use, its help page says
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like [, and in particular the non-standard evaluation
of argument ‘subset’ can have unanticipated consequences.