Home Calculating a similarity matrix in R counting only shared columns of binary data

Questions

Calculating a similarity matrix in R counting only shared columns of binary data

July 21, 2023

Working in R. Trying to calculate the similarity/distance of rows in a data.frame (each row is an item) from each other according to shared membership in groups (columns). However, I don’t want 0 values (i.e. not being a member in a group) to contribute to the similarity. (What I want is kind of like Manhattan distance, but with different handling of 0’s).

For example, for this dataset:

Group1	Group2	Group3
0	0	0
1	0	0
0	1	0
0	0	1
1	1	0
1	0	1
0	1	1
1	1	1

I want a similarity matrix that looks like this:

2	3	4	5	6	7	8
0	0	0	0	0	0	0
1	0	0	1	1	0	1
0	1	0	1	0	1	1
0	0	1	0	1	1	1
1	1	0	2	1	1	2
1	0	1	1	2	1	2
0	1	1	1	1	2	2
1	1	1	2	2	2	3

Note that the diagonal values aren’t particularly important for my downstream applications, so alternative methods that give the same output as this but with a different diagonal are a fine solution for me.

Given the first matrix, some very very slow code that can calculate the second similarity matrix is:

calc_simil <- function(x) {
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  combos <- expand.grid(1:nrow(x), 1:nrow(x))
  for (myrow in 1:nrow(combos)) {
    temp <- x[c(combos[myrow, 1], combos[myrow, 2]), ]
    out[combos[myrow, 1], combos[myrow, 2]] <-
      out[combos[myrow, 2], combos[myrow, 1]] <-
      sum((1-apply(temp, function(x) {any(x == 0)}, MARGIN = 2)) *
      (1 - abs(temp[1, ] - temp[2, ])))
  }
  return(out)
}

I know there must be a more efficient way to do this, probably using some matrix multiplication wizardry, but I can’t figure it out. I’ve also looked at various built-in methods to calculate distance, including some functions from R packages, but none seem to calculate this number of shared groups while ignoring shared absences from groups.

Anyone have any suggestions? Have I simply overlooked a common built-in distance method? Or is there some much faster way to calculate this distance/similarity?

>Solution :

You can simply do a tcrossprod. ie as.matrix(df) %*% t(df)

tcrossprod(as.matrix(df))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    0    0    0    0    0
[2,]    0    1    0    0    1    1    0    1
[3,]    0    0    1    0    1    0    1    1
[4,]    0    0    0    1    0    1    1    1
[5,]    0    1    1    0    2    1    1    2
[6,]    0    1    0    1    1    2    1    2
[7,]    0    0    1    1    1    1    2    2
[8,]    0    1    1    1    2    2    2    3
>