How to remove columns that contain all the same value

March 14, 2023

I have count data (columns) in the form of presence/absence (1/0) of various genes in different samples that belong to one of two categories. I am doing a Fisher’s (fisher.test) for each gene, but I get an error whenever that gene is present (1) or absent (0) from all samples. How can I remove or skip these columns, or have the command fisher.test ignore or skip these genes and keep going?

Here is my sample data:

mydata <- data.frame(sampleID = c("A", "B", "C", "D", "E", "F", "G"),
                     category = c("high", "low", "high", "high", "low", "high", "low"),
                     Gene1 = c(1, 1, 0, 0, 0, 1, 1),
                     Gene2 = c(0, 1, 1, 1, 1, 1, 0),
                     Gene3 = c(0, 0, 0, 1, 1, 1, 1),
                     Gene4 = c(1, 1, 1, 1, 1, 1, 1)

Here is the loop code that someone helped me design, which applies the fisher.test to each gene:

library(dplyr)
library(tidyr)
library(broom)

mydata %>%
  select(-sampleID) %>%
  pivot_longer(cols = -category, names_to = "gene") %>%
  group_by(gene) %>%
  summarise(fisher_test = list(tidy(fisher.test(table(category, value))))) %>%
  unnest(fisher_test) %>%
  mutate(odds_ratio = exp(estimate)) %>% 
  select(-method, -alternative)

The error message I get when it encounters a gene that is present or absent from all samples:

Caused by error in `fisher.test()`:
! 'x' must have at least 2 rows and columns
Run `rlang::last_error()` to see where the error occurred.

Where can I insert this step into the loop above?

Note: It is not feasible to omit the genes manually, as there are hundreds of them.

>Solution :

We could use

library(dplyr)
library(tidyr)
mydata %>% 
   select(!where(~ is.numeric(.x) && n_distinct(.x) == 1),-sampleID) %>%
 
  pivot_longer(cols = -category, names_to = "gene") %>%
  group_by(gene) %>%
  summarise(fisher_test = list(tidy(fisher.test(table(category, value))))) %>%
  unnest(fisher_test) %>%
  mutate(odds_ratio = exp(estimate)) %>% 
  select(-method, -alternative)

-output

# A tibble: 3 × 6
  gene  estimate p.value conf.low conf.high odds_ratio
  <chr>    <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
1 Gene1    1.81        1  0.0469      176.        6.11
2 Gene2    0.707       1  0.00640      78.2       2.03
3 Gene3    1.81        1  0.0469      176.        6.11