Apologies if the title is confusing, but below is what I would like to accomplish. Let’s say I have a table as seen below.
df <- data.frame(
patient = paste0("patient",seq(1:6)),
gene_1 = c(10,5,0,0,1,0),
gene_2 = c(0,26,4,5,6,1),
gene_3 = c(1,3,5,12,44,1)
)
| patient | gene_1 | gene_2 | gene_3 |
|---|---|---|---|
| patient1 | 10 | 0 | 1 |
| patient2 | 5 | 26 | 3 |
| patient3 | 0 | 4 | 5 |
| patient4 | 0 | 5 | 12 |
| patient5 | 1 | 6 | 44 |
| patient6 | 0 | 1 | 1 |
What I want is another table that records the total number of pairs only if both values are non-zero. The table would look like so:
| col1 | col2 | number-of-pairs |
|---|---|---|
| gene1 | gene2 | 2 |
| gene1 | gene3 | 3 |
| gene2 | gene3 | 5 |
Any help is appreciated. Thank you.
>Solution :
We can do this by pivoting your data to a long format, doing a self-join, and then filtering:
library(tidyr)
library(dplyr)
## Long format, keep only non-zeros
long_data = pivot_longer(df, -patient) %>%
filter(value != 0) %>%
select(-value)
## Self join on patient,
## Remove exact matches (can't pair with yourself)
## And use < to remove doublecounts
long_data %>%
left_join(long_data, by = "patient") %>%
filter(name.x != name.y & name.x < name.y) %>%
count(name.x, name.y)
# # A tibble: 3 × 3
# name.x name.y n
# <chr> <chr> <int>
# 1 gene_1 gene_2 2
# 2 gene_1 gene_3 3
# 3 gene_2 gene_3 5