Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

find closest match in gene expression data

I am analyzing a dataset and need to find matching samples between 2 versions of the data.
they (should) contain the same expression data but they have different sample identifiers. Lets say the first dataframe looks like this:

   gene sample expression
1     a      a          1
2     a      b          2
3     a      c          3
4     a      d          4
5     a      e          5
6     a      f          6
7     a      g          7
8     a      h          8
9     a      i          9
10    a      j         10
11    a      k         11
12    a      l         12
13    a      m         13
14    a      n         14

I made the dataframe for one gene, but u can imagine that this is a large dataset containing ~20k genes. What I need to do is find the closest match in gene expression so I know which samples correspond. the second dataframe might look like this:

   gene sample expression
1     a      z        1.5
2     a      y        2.5
3     a      x          3
4     a      w        4.5
5     a      v        5.7
6     a      u        6.2
7     a      t        7.8
8     a      s        8.1
9     a      r        9.8
10    a      q       10.5
11    a      p         11
12    a      o         12
13    a      2       13.3
14    a      4       14.4

what I need to do is write a function (or something like that) that try’s to match the expressions of genes in a dataframe as closely as possible (for all genes) and report the sample identifiers with the closest match. I’m quite new to R and could use a little help.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I would like the output to look like this::

   gene sample expression sample2
1     a      z          1       z
2     a      y          2       y
3     a      x          3       x
4     a      w          4       w
5     a      v          5       v
6     a      u          6       u
7     a      t          7       t
8     a      s          8       s
9     a      r          9       r
10    a      q         10       q
11    a      p         11       p
12    a      o         12       o
13    a      2         13       2
14    a      4         14       4 

an extra column per sample that sepcifies the closest match in gene expression accros all genes. But the extra column must be created based on all genes and not on one gene.

>Solution :

Here are two options. In your example, it looks like there are always whole number matches, so you could join by whole number. Alternatively, you could try to extract the closest number. I use floor because it looks like you want 1.5 to be joined to 1 and not 2.

library(tidyverse)


#extract closest whole number
df1 |>
  mutate(sample2 = map_chr(expression, 
                           \(x)df2$sample[which.min(abs(x - floor(df2$expression)))]))
#> # A tibble: 14 x 4
#>    gene  sample expression sample2
#>    <chr> <chr>       <dbl> <chr>  
#>  1 a     a               1 z      
#>  2 a     b               2 y      
#>  3 a     c               3 x      
#>  4 a     d               4 w      
#>  5 a     e               5 v      
#>  6 a     f               6 u      
#>  7 a     g               7 t      
#>  8 a     h               8 s      
#>  9 a     i               9 r      
#> 10 a     j              10 q      
#> 11 a     k              11 p      
#> 12 a     l              12 o      
#> 13 a     m              13 2      
#> 14 a     n              14 4

#join by whole number
left_join(df1, 
          df2 |>
            mutate(expression = as.numeric(gsub("^(.*)\\.\\d+$", "\\1", expression))) |>
            select(sample2 = sample, expression),
          by = "expression")
#> # A tibble: 14 x 4
#>    gene  sample expression sample2
#>    <chr> <chr>       <dbl> <chr>  
#>  1 a     a               1 z      
#>  2 a     b               2 y      
#>  3 a     c               3 x      
#>  4 a     d               4 w      
#>  5 a     e               5 v      
#>  6 a     f               6 u      
#>  7 a     g               7 t      
#>  8 a     h               8 s      
#>  9 a     i               9 r      
#> 10 a     j              10 q      
#> 11 a     k              11 p      
#> 12 a     l              12 o      
#> 13 a     m              13 2      
#> 14 a     n              14 4
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading