Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Error looping the ANOVA test in R in dataframe

I have the dataframe:

sub3 <- df1[, c('Attrition', "Age", "DistanceFromHome", "MonthlyIncome", "NumCompaniesWorked",    "PercentSalaryHike", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany", "YearsSinceLastPromotion", "YearsWithCurrManager")]
    sub3

Where Attrition is the response variable.

I am trying to run a loop of ANOVA test in R to validate the relation between my response variable and the categorical ones, my code is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df_num <- function(x) {
  aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, x], data = sub3)

  res <- data.frame('row' = 'Attrition'
                , 'column' = colnames(sub3)[x]
                ,  "p.value" = summary(aov)[[1]][["Pr(>F)"]]
                )
  return(res)
}
num_df <- do.call(rbind, lapply(seq_along(sub3)[-1], df_num))
head(num_df)

But my result is:

                                               p.value
1   Attrition   Age                1.996802e-26 
2   Attrition   Age                 NA  
3   Attrition   DistanceFromHome    5.182860e-01    
4   Attrition   DistanceFromHome    NA  
5   Attrition   MonthlyIncome          3.842748e-02 
6   Attrition   MonthlyIncome           NA  

I do not understand why the code is not running for all dataset variables and the reason why the Age, DistanceFromHome and MonthlyIncome are duplicated

>Solution :

Your code probably runs for all the variables but you’re only displaying the first 6 entries by running head! Try running print(num_df, n=nrow(num_df)), which will display all entries.

The reason for the duplicated values in num_df is that the aov object you’re creating has 2 rows, so subsetting the column Pr(>F) returns two values. You can test for yourself by trying this, which will compute ANOVA for the pair of Attrition and Age:

aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, 2], data = sub3)
summary(aov)[[1]][["Pr(>F)"]]  # this will report the p-value, and a NA value

To fix the duplication, you need to extract the first value from the Pr(>F) column, like so:

df_num <- function(x) {
  aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, x], data = sub3)

  res <- data.frame('row' = 'Attrition'
                , 'column' = colnames(sub3)[x]
                ,  "p.value" = summary(aov)[[1]][["Pr(>F)"]][1]  # use only the first value of the p-value column
                )
  return(res)
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading