Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

TukeyHSD returns too many values in R

I’m very new to R (and statistics) and I searched a lot for a possible solution, but couldn’t find any.

I have a data set with around 18000 entries, which contain two columns: "rentals" and "season". I want to analyse if there is a difference between the mean of the rentals depending on the season using an one-way ANOVA.

My data looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

rentals season
23 1
12 1
17 2
16 2
44 3
22 3
2 4
14 4

First I calculate the SD and MEAN of the groups (season):

anova %>%
    group_by(season) %>%
    summarise(
        count_season = n(),
        mean_rentals = mean(rentals, na.rm = TRUE),
        sd_rentals = sd(rentals, na.rm = TRUE))

This is the result:

enter image description here

Then I perform the one-way ANOVA:

anova_one_way <- aov(season~as.factor(rentals), data = anova)
summary(anova_one_way)
<!-- I use "as.factor" on rentals, because otherwise I'm getting an error with TukeyHSD -->

Result:
enter image description here

Here comes the tricky part. I perform a TukeyHSD test:

TukeyHSD(anova_one_way) 

And the results are very disappointing. TukeyHSD returns 376896 rows, while I expect it to return just a few, comparing the seasons with each other. It looks like every single "rentals" row is being handled as a single group. This seems to be very wrong but I can’t find the cause. Is this a common TukeyHSD behaviour considering the big data set or is there an error in my code or logic, which causes this enormous unreadable list of values as a return?

Here is a small insight on how it looks like (and it goes on until 376896).
enter image description here

>Solution :

The terms are the wrong way around in your aov() call. Rentals is the outcome (dependent) variable, season is the predictor (independent) variable.

So you want:

anova_one_way <- aov(rentals ~ factor(season), data = anova)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading