I have a huge dataset and that look like this.
To save some memory I want to calculate the pairwise distance but leave the
upper diagonal of the matrix to NULL.
library(tidyverse)
library(stringdist)
#>
#> Attaching package: 'stringdist'
#> The following object is masked from 'package:tidyr':
#>
#> extract
df3 <- tibble(fruits=c("apple","banana","ananas","apple","ananas","apple","ananas"),
position=c("135","135","135","136","137","138","138"),
counts = c(100,200,100,30,40,50,100))
stringdistmatrix(df3$fruits, method=c("osa"), nthread = 4) %>%
as.matrix()
#> 1 2 3 4 5 6 7
#> 1 0 5 5 0 5 0 5
#> 2 5 0 2 5 2 5 2
#> 3 5 2 0 5 0 5 0
#> 4 0 5 5 0 5 0 5
#> 5 5 2 0 5 0 5 0
#> 6 0 5 5 0 5 0 5
#> 7 5 2 0 5 0 5 0
Created on 2022-03-01 by the reprex package (v2.0.1)
However when I convert my stringdistmatrix to matrix (This step is essential for me),
my upper diagonal get filled with numbers.
Is there anyway to convert to matrix but keep upper diagonal to NULL and save memory?
I want my data to look like this
1 2 3 4 5 6
2 5
3 5 2
4 0 5 5
5 5 2 0 5
6 0 5 5 0 5
7 5 2 0 5 0 5
>Solution :
I think you may need to use sparse matrices. Package Matrix has such a possibility. You can learn more about sparse matrices at: Sparse matrix
library(Matrix)
m <- sparseMatrix(i = c(1:3, 2:3, 3), j=c(1:3,1:2, 1), x = 1, triangular = T)
m
#> 3 x 3 sparse Matrix of class "dtCMatrix"
#>
#> [1,] 1 . .
#> [2,] 1 1 .
#> [3,] 1 1 1
I suspect, however, that @Maël ‘s solution may be the best for relatively small matrices:
library(tidyverse)
library(stringdist)
mat <- stringdistmatrix(df3$fruits, method=c("osa"), nthread = 4) %>%
as.matrix()
mat2 <- mat[!lower.tri(mat)] <- NA
object.size(mat)
#> 1792 bytes
object.size(mat2)
#> 56 bytes
Anyway, @LDT, you can try declare your matrices using both ways and then you can use function object.size to evaluate which way is less memory consuming.