Home Identifying data frame rows in R with specific pairs of values in two columns

Questions

Identifying data frame rows in R with specific pairs of values in two columns

January 2, 2025

I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix

testmat=rbind(c(1,1),c(1,2),c(1,4),c(2,1),c(2,4),c(3,4),c(3,10))

I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns

of_interest = rbind(c(1,2),c(2,4))

The following does not work

which(testmat[,1] %in% of_interest[,1] & testmat[,2] %in% of_interest[,2])

because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I’m querying for. There must be some simple way to use the which…%in%… to match specific pairs like this, but I haven’t been able to find an example of this that works.

Note that I need the positions/row numbers of the rows which match the desired condition.

>Solution :

I assume as you’re using which() you want the position, rather than just whether there is a match. You can cbind() the row number to testmat and then merge() this with of_interest.

merge(
    cbind(testmat, seq_len(nrow(testmat))),
    of_interest
) |> setNames(c("x", "y", "row_num"))

#   x y row_num
# 1 1 2       2
# 2 2 4       5

`Rcpp` approach with very large matrix

You mention in your comments that you have 10^8 rows. This makes me think two things:

Don’t merge() as this will coerce to data frame which is very expensive.
You want to break the loop as early as soon as match is found rather than continuing to iterate. See this question for performance advantages of doing this.

Given this I would avoid using which() or other approaches which do not exit early. Here’s some Rcpp code that should be much faster than merge() with large datasets:

Rcpp::cppFunction("
IntegerVector get_row_position(NumericMatrix testmat, NumericMatrix of_interest) {
    const R_xlen_t nrow_testmat = testmat.nrow();
    const R_xlen_t nrow_of_interest = of_interest.nrow();

    IntegerVector result;

    // loop through the rows of testmat
    for (R_xlen_t i = 0; i < nrow_testmat; ++i) {
        NumericMatrix::Row test_row = testmat(i, _);

        for (R_xlen_t j = 0; j < nrow_of_interest; ++j) {
            NumericMatrix::Row interest_row = of_interest(j, _);

            if (is_true(all(test_row == interest_row))) {
                result.push_back(i + 1); // because of 1-indexing
                break; // leave inner loop early
            }
        }
    }
    return result;
}
")

get_row_position(testmat, of_interest)
# [1] 2 5

I think accessing rows as sub-matrices is more idiomatic Rcpp code than a double for-loop with array indexing, but I have no idea which is faster so if performance is your primary concern I’d try various approaches and benchmark.