Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Recursive method for calculate percentual of repeated values for each column in my df with R

I need to use lapply/sapply or other recursive methods for my real df for calculate how many repeated values have in each column/variable.

Here I used an small example to reproduce my case:

library(dplyr)

df <- data.frame(
var1 = c(1,2,3,4,5,6,7,8,9,10 ),
var2 = c(1,1,2,3,4,5,6,7,9,10 ),
var3 = c(1,1,1,2,3,4,5,6,7,8 ),
var4 = c(2,2,1,1,2,1,1,2,1,2 ),
var5 = c(1,1,1,1,1,4,5,5,6,7 ),
var6 = c(4,4,4,5,5,5,5,5,5,5 )   
)

I have r nrow(df) in my dataset and now I need to obtain the % of repeated values for each column. Suppose that my real df have a lot of columns, and I need to do it recursively. I tryed to use lapply/sapply, but it didn´t worked…

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# create function that is used in lapply
perc_repeated <- function(variables){
  
  paste(round((sum(table(df$variables)-1) / nrow(df))*100,2),"%")
  
}

perce_repeated_values <- lapply(df, perc_repeated) 
perce_repeated_values

How to do this optimally if my dataframe increases in number of columns to something like 700, using some recursive function for each column and getting the results in an orderly way in a dataframe from largest to smallest ? (eg of the variable that has it 100% repeated values for the one that reaches 0%) in something like:

df_repeated

variable      perc_repeated_values
var6                    80%
var4                    80%
var5                    50%
var3                    20%
var2                    20%
var1                     0%

>Solution :

This can easily be done with dplyr::summarize()

library(tidyverse)

df <- data.frame(
  var1 = c(1,2,3,4,5,6,7,8,9,10 ),
  var2 = c(1,1,2,3,4,5,6,7,9,10 ),
  var3 = c(1,1,1,2,3,4,5,6,7,8 ),
  var4 = c(2,2,1,1,2,1,1,2,1,2 ),
  var5 = c(1,1,1,1,1,4,5,5,6,7 ),
  var6 = c(4,4,4,5,5,5,5,5,5,5 )   
)

df %>% 
  summarise(across(everything(),
                   ~100 * (1 - length(unique(.x))/length(.x)))) %>% 
  pivot_longer(everything(), 
               names_to = "var", 
               values_to = "percent_repeated") %>% 
  arrange(desc(percent_repeated))
#> # A tibble: 6 x 2
#>   var   percent_repeated
#>   <chr>            <dbl>
#> 1 var4                80
#> 2 var6                80
#> 3 var5                50
#> 4 var3                20
#> 5 var2                10
#> 6 var1                 0

Created on 2022-01-09 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading