Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculating MAD in two different ways in R return different results

(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).


I have a large dataframe data with following columns:

date        time        orig      new
2001-01-01  00:30:00    345       856
2001-01-01  00:32:43    4575      9261
2001-01-01  00:51:07    6453      2352
...
2001-01-01  23:57:51    421       168
2001-01-02  00:06:14    5612      3462
...
2001-01-31  23:49:11    14420     8992
2001-02-01  00:04:32    213       521
...

I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new)) when grouped by month. Ideally, at the end, I want the solutions (dataframe) in a following form:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

month       mad
2001-01-01  7452.124
2001-02-01  3946.734
2001-03-01  995.938
...

I calculated the monthly MAD in two different ways.

Approach 1

I grouped data by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):

data %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(abs(orig - new)))

Approach 2

I grouped data by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.

data_grouped_by_hour <- data %>%
   group_by(
        day = lubridate::floor_date(date, 'day'),
        hour = as.POSIXlt(time)$hour
   ) %>%
   summarise(mad = mean(abs(orig - new)))

data_grouped_by_hour %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(mad))

As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.

I want to note that I would prefer Approach 2 so that I don’t have to make separate tables for every analysis with different time unit. Any insights are appreciated.

>Solution :

Because average of average is not the same as complete average.

This is a common misconception. Let’s try to understand with the help of an example –

Consider a list with 2 elements a and b

x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))

Now, similar to your question we will take average in 2 ways –

  1. Average of all the values of x
res1 <- mean(unlist(x))
res1
#[1] 4.25
  1. Average of each element separately and then complete average.
sapply(x, mean)
#       a        b 
#3.833333 5.500000 

res2 <- mean(sapply(x, mean))
res2
#[1] 4.666667

Notice that res1 and res2 has different values because the 2nd case is average of averages.

The same logic applies in your case as well when you take daily average and then monthly which is average of averages.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading