I would like to identify all rows in a data frame (or matrix) whose values in column 1 and 2 match a specific pair. For example, if I have a matrix
testmat=rbind(c(1,1),c(1,2),c(1,4),c(2,1),c(2,4),c(3,4),c(3,10))
I would like to identify the rows that contain any of the following pairs, i.e. all rows that contain a combination of either 1,2 or 2,4 in their first and second columns
of_interest = rbind(c(1,2),c(2,4))
The following does not work
which(testmat[,1] %in% of_interest[,1] & testmat[,2] %in% of_interest[,2])
because, as expected, it returns all combinations of 1,2 in the first column and 2,4 in the second (i.e. rows 2,3,5 rather than just rows 2 and 5 as desired), so that the row [1,4] is included even though this is not one of the pairs I’m querying for. There must be some simple way to use the which…%in%… to match specific pairs like this, but I haven’t been able to find an example of this that works.
Note that I need the positions/row numbers of the rows which match the desired condition.
>Solution :
I assume as you’re using which() you want the position, rather than just whether there is a match. You can cbind() the row number to testmat and then merge() this with of_interest.
merge(
cbind(testmat, seq_len(nrow(testmat))),
of_interest
) |> setNames(c("x", "y", "row_num"))
# x y row_num
# 1 1 2 2
# 2 2 4 5
Rcpp approach with very large matrix
You mention in your comments that you have 10^8 rows. This makes me think two things:
- Don’t
merge()as this will coerce to data frame which is very expensive. - You want to break the loop as early as soon as match is found rather than continuing to iterate. See this question for performance advantages of doing this.
Given this I would avoid using which() or other approaches which do not exit early. Here’s some Rcpp code that should be much faster than merge() with large datasets:
Rcpp::cppFunction("
IntegerVector get_row_position(NumericMatrix testmat, NumericMatrix of_interest) {
const R_xlen_t nrow_testmat = testmat.nrow();
const R_xlen_t nrow_of_interest = of_interest.nrow();
IntegerVector result;
// loop through the rows of testmat
for (R_xlen_t i = 0; i < nrow_testmat; ++i) {
NumericMatrix::Row test_row = testmat(i, _);
for (R_xlen_t j = 0; j < nrow_of_interest; ++j) {
NumericMatrix::Row interest_row = of_interest(j, _);
if (is_true(all(test_row == interest_row))) {
result.push_back(i + 1); // because of 1-indexing
break; // leave inner loop early
}
}
}
return result;
}
")
get_row_position(testmat, of_interest)
# [1] 2 5
I think accessing rows as sub-matrices is more idiomatic Rcpp code than a double for-loop with array indexing, but I have no idea which is faster so if performance is your primary concern I’d try various approaches and benchmark.