Error looping the ANOVA test in R in dataframe

I have the dataframe:

sub3 <- df1[, c('Attrition', "Age", "DistanceFromHome", "MonthlyIncome", "NumCompaniesWorked",    "PercentSalaryHike", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany", "YearsSinceLastPromotion", "YearsWithCurrManager")]
    sub3

Where Attrition is the response variable.

I am trying to run a loop of ANOVA test in R to validate the relation between my response variable and the categorical ones, my code is:

df_num <- function(x) {
  aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, x], data = sub3)

  res <- data.frame('row' = 'Attrition'
                , 'column' = colnames(sub3)[x]
                ,  "p.value" = summary(aov)[[1]][["Pr(>F)"]]
                )
  return(res)
}
num_df <- do.call(rbind, lapply(seq_along(sub3)[-1], df_num))
head(num_df)

But my result is:

                                               p.value
1   Attrition   Age                1.996802e-26 
2   Attrition   Age                 NA  
3   Attrition   DistanceFromHome    5.182860e-01    
4   Attrition   DistanceFromHome    NA  
5   Attrition   MonthlyIncome          3.842748e-02 
6   Attrition   MonthlyIncome           NA  

I do not understand why the code is not running for all dataset variables and the reason why the Age, DistanceFromHome and MonthlyIncome are duplicated

>Solution :

Your code probably runs for all the variables but you’re only displaying the first 6 entries by running head! Try running print(num_df, n=nrow(num_df)), which will display all entries.

The reason for the duplicated values in num_df is that the aov object you’re creating has 2 rows, so subsetting the column Pr(>F) returns two values. You can test for yourself by trying this, which will compute ANOVA for the pair of Attrition and Age:

aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, 2], data = sub3)
summary(aov)[[1]][["Pr(>F)"]]  # this will report the p-value, and a NA value

To fix the duplication, you need to extract the first value from the Pr(>F) column, like so:

df_num <- function(x) {
  aov <- aov(as.numeric(sub3$Attrition) ~ sub3[, x], data = sub3)

  res <- data.frame('row' = 'Attrition'
                , 'column' = colnames(sub3)[x]
                ,  "p.value" = summary(aov)[[1]][["Pr(>F)"]][1]  # use only the first value of the p-value column
                )
  return(res)
}

Leave a Reply