Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculating a similarity matrix in R counting only shared columns of binary data

Working in R. Trying to calculate the similarity/distance of rows in a data.frame (each row is an item) from each other according to shared membership in groups (columns). However, I don’t want 0 values (i.e. not being a member in a group) to contribute to the similarity. (What I want is kind of like Manhattan distance, but with different handling of 0’s).

For example, for this dataset:

Group1 Group2 Group3
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1

I want a similarity matrix that looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1
0 0 1 0 1 0 1 1
0 0 0 1 0 1 1 1
0 1 1 0 2 1 1 2
0 1 0 1 1 2 1 2
0 0 1 1 1 1 2 2
0 1 1 1 2 2 2 3

Note that the diagonal values aren’t particularly important for my downstream applications, so alternative methods that give the same output as this but with a different diagonal are a fine solution for me.

Given the first matrix, some very very slow code that can calculate the second similarity matrix is:

calc_simil <- function(x) {
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  combos <- expand.grid(1:nrow(x), 1:nrow(x))
  for (myrow in 1:nrow(combos)) {
    temp <- x[c(combos[myrow, 1], combos[myrow, 2]), ]
    out[combos[myrow, 1], combos[myrow, 2]] <-
      out[combos[myrow, 2], combos[myrow, 1]] <-
      sum((1-apply(temp, function(x) {any(x == 0)}, MARGIN = 2)) *
      (1 - abs(temp[1, ] - temp[2, ])))
  }
  return(out)
}

I know there must be a more efficient way to do this, probably using some matrix multiplication wizardry, but I can’t figure it out. I’ve also looked at various built-in methods to calculate distance, including some functions from R packages, but none seem to calculate this number of shared groups while ignoring shared absences from groups.

Anyone have any suggestions? Have I simply overlooked a common built-in distance method? Or is there some much faster way to calculate this distance/similarity?

>Solution :

You can simply do a tcrossprod. ie as.matrix(df) %*% t(df)

tcrossprod(as.matrix(df))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    0    0    0    0    0
[2,]    0    1    0    0    1    1    0    1
[3,]    0    0    1    0    1    0    1    1
[4,]    0    0    0    1    0    1    1    1
[5,]    0    1    1    0    2    1    1    2
[6,]    0    1    0    1    1    2    1    2
[7,]    0    0    1    1    1    1    2    2
[8,]    0    1    1    1    2    2    2    3
> 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading