Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using loop to repeat the same function for different datasets

I used the list to create 4 datasets. Now I want to list all potential ID variables in each dataset. My criteria are: 1)if this variable has over 80% unique observations; 2) If this variable does not have missing value over 30%.

To get those statistic variables, I first use skimr function in R to get a tibble containing all information, then I used filter to sift out the variables I am looking for based on the two criteria aforementioned. Here is my code:

 dfa<- dflist[[1]]%>%
      mutate_if(is.numeric,as.character)%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >=nrow(dflist[[1]])*0.01)%>%
      filter(n_missing<=nrow(dflist[[1]])*0.30)

This code works fine and returns the expected variables for dataset 1. However, I have 4 different size datasets, so I am considering to integrate it into a loop code. Here is my try:
First, I create a dfid list to contain the new results since I do not want the dflist is modified. Then I changed 1 in previous code in dflist[[1]] to "i". But this code does not work, the R warns that "Error in filter(., dflist[[i]][, character.n_unique] >= nrow(dflist[[1]]) * :
Caused by error in [.data.frame:
! undefined columns selected".

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here is my code:

dfid<-list()
for (i in 1:4){
    dfid[[i]]<-dflist[[i]]%>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(dflist[[i]][,character.n_unique] >=nrow(dflist[[i]])*0.01)%>%
            filter(dflist[[i]][,n_missing]<=nrow(dflist[[i]])*0.30)
}

So my questions are:

  1. How to fix this error to make the goal possible?
  2. Once the dfid[[i]] has desired variables from 4 different datasets, what code I should add in to loop to combine them (4 lists) together and distinct the variable name, finally get the vector of variable names from this combined list or dataset?

Thanks a lot for your help in advance~~!

>Solution :

The columns should be quoted if we are using [ unless it is an object. It may be easier to loop with map/lapply

library(purrr)
library(dplyr)
dfid <- map(dflist, ~ .x %>% 
      mutate(across(where(is.numeric), as.character))%>%
      skim()%>%
      as_tibble()%>%
      filter(character.n_unique >= n()*0.01)%>%
      filter(n_missing <= n()*0.30))

We don’t need the [ when we use the chain

dfid <- vector('list', length(dflist))
for (i in seq_along(dflist)){
    tmp <- dflist[[i]]
      dfid[[i]] <-  tmp %>%
            mutate_if(is.numeric,as.character)%>%
            skim()%>%
            as_tibble()%>%
            filter(character.n_unique >=n()*0.01)%>%
            filter(n_missing <=n()*0.30)
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading