find closest match in gene expression data

October 26, 2022

I am analyzing a dataset and need to find matching samples between 2 versions of the data.
they (should) contain the same expression data but they have different sample identifiers. Lets say the first dataframe looks like this:

   gene sample expression
1     a      a          1
2     a      b          2
3     a      c          3
4     a      d          4
5     a      e          5
6     a      f          6
7     a      g          7
8     a      h          8
9     a      i          9
10    a      j         10
11    a      k         11
12    a      l         12
13    a      m         13
14    a      n         14

I made the dataframe for one gene, but u can imagine that this is a large dataset containing ~20k genes. What I need to do is find the closest match in gene expression so I know which samples correspond. the second dataframe might look like this:

   gene sample expression
1     a      z        1.5
2     a      y        2.5
3     a      x          3
4     a      w        4.5
5     a      v        5.7
6     a      u        6.2
7     a      t        7.8
8     a      s        8.1
9     a      r        9.8
10    a      q       10.5
11    a      p         11
12    a      o         12
13    a      2       13.3
14    a      4       14.4

what I need to do is write a function (or something like that) that try’s to match the expressions of genes in a dataframe as closely as possible (for all genes) and report the sample identifiers with the closest match. I’m quite new to R and could use a little help.

I would like the output to look like this::

   gene sample expression sample2
1     a      z          1       z
2     a      y          2       y
3     a      x          3       x
4     a      w          4       w
5     a      v          5       v
6     a      u          6       u
7     a      t          7       t
8     a      s          8       s
9     a      r          9       r
10    a      q         10       q
11    a      p         11       p
12    a      o         12       o
13    a      2         13       2
14    a      4         14       4

an extra column per sample that sepcifies the closest match in gene expression accros all genes. But the extra column must be created based on all genes and not on one gene.

>Solution :

Here are two options. In your example, it looks like there are always whole number matches, so you could join by whole number. Alternatively, you could try to extract the closest number. I use floor because it looks like you want 1.5 to be joined to 1 and not 2.

library(tidyverse)


#extract closest whole number
df1 |>
  mutate(sample2 = map_chr(expression, 
                           \(x)df2$sample[which.min(abs(x - floor(df2$expression)))]))
#> # A tibble: 14 x 4
#>    gene  sample expression sample2
#>    <chr> <chr>       <dbl> <chr>  
#>  1 a     a               1 z      
#>  2 a     b               2 y      
#>  3 a     c               3 x      
#>  4 a     d               4 w      
#>  5 a     e               5 v      
#>  6 a     f               6 u      
#>  7 a     g               7 t      
#>  8 a     h               8 s      
#>  9 a     i               9 r      
#> 10 a     j              10 q      
#> 11 a     k              11 p      
#> 12 a     l              12 o      
#> 13 a     m              13 2      
#> 14 a     n              14 4

#join by whole number
left_join(df1, 
          df2 |>
            mutate(expression = as.numeric(gsub("^(.*)\\.\\d+$", "\\1", expression))) |>
            select(sample2 = sample, expression),
          by = "expression")
#> # A tibble: 14 x 4
#>    gene  sample expression sample2
#>    <chr> <chr>       <dbl> <chr>  
#>  1 a     a               1 z      
#>  2 a     b               2 y      
#>  3 a     c               3 x      
#>  4 a     d               4 w      
#>  5 a     e               5 v      
#>  6 a     f               6 u      
#>  7 a     g               7 t      
#>  8 a     h               8 s      
#>  9 a     i               9 r      
#> 10 a     j              10 q      
#> 11 a     k              11 p      
#> 12 a     l              12 o      
#> 13 a     m              13 2      
#> 14 a     n              14 4