Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

loop over factors and numerics to calculate their means

I am trying to create a function that loops over my entire data frame. If the column is a numeric it will return the mean, but if the column is a factor it will have to do a little more to get the overall mean. At the moment, I am less concerned about the frequencies for the categories in the factor–I have research reasons for this. So far, I have been able to cobble some of this together, but I know I am nowhere it needs to be to accomplish this. Here is my code so far:

#basic data frame 3 variables
dat = data.frame("index" = c(1, 2, 3, 4, 5),
                     "age" = c(24, 25, 42, 56, 22), 
                     "sex" = c(0,1,1,0,0))

mean(dat$sex)
mean(dat$age)

#converting sex into a factor
dat[,3] = as.factor(dat[,3]) 

#working on the if structure to calculate the mean for all of the variables

me_func = function(x){
for (i in seq_along(x)){
if (is.factor(x)==TRUE){
  return(mean(as.numeric(as.character(x), na.rm=TRUE)))
} else {
  return(mean(x), na.rm=TRUE)
}
}
}
me_func(dat)

Because I am trying to learn coding with R, I know I am missing a lot. My intent in the function call is to use the data frame name as the input. Given when I use this for my research, will have much larger data frames, so listing out the names themselves will be rather cumbersome. This, also, complicates things because the id variable will have to be ignored to get this correct.

Ultimately, I need the function to return the proper means of 0.40 for the factor variable and 33.8 for the numerical variable. I need to be able to learn this process as it appears to be important for the data analyses I will be doing in the foreseeable future. I thought about ColMeans, but this does not get me out of a loop or some type of apply. The factors would have to be coerced to numerics to do this, and the coercion may provide non-sensical means as R has a tendency to change a 0 to a 2 when it is coerced, or at least, in my extremely limited experience it seems to do this. I, legitimately, only want the mean for all of the non-id variables/columns for the entire data frame. Does anyone have any ideas on how this will work? If I have missed a post that does this already, please, feel free to point me in that direction. Thank you

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can create my_func as a function that gets the mean of a vector (remove the for loop), and then apply it to every column using sapply.

me_func = function(x){
  if (is.factor(x)==TRUE){
    return(mean(as.numeric(as.character(x)), na.rm=TRUE))
  } else {
    return(mean(x, na.rm=TRUE))
  }
}

> sapply(dat[,-1], me_func)
 age  sex 
33.8  0.4 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading