Home Calculating MAD in two different ways in R return different results

Questions

Calculating MAD in two different ways in R return different results

December 15, 2022

(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).

I have a large dataframe data with following columns:

date        time        orig      new
2001-01-01  00:30:00    345       856
2001-01-01  00:32:43    4575      9261
2001-01-01  00:51:07    6453      2352
...
2001-01-01  23:57:51    421       168
2001-01-02  00:06:14    5612      3462
...
2001-01-31  23:49:11    14420     8992
2001-02-01  00:04:32    213       521
...

I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new)) when grouped by month. Ideally, at the end, I want the solutions (dataframe) in a following form:

month       mad
2001-01-01  7452.124
2001-02-01  3946.734
2001-03-01  995.938
...

I calculated the monthly MAD in two different ways.

Approach 1

I grouped data by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):

data %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(abs(orig - new)))

Approach 2

I grouped data by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.

data_grouped_by_hour <- data %>%
   group_by(
        day = lubridate::floor_date(date, 'day'),
        hour = as.POSIXlt(time)$hour
   ) %>%
   summarise(mad = mean(abs(orig - new)))

data_grouped_by_hour %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(mad))

As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.

I want to note that I would prefer Approach 2 so that I don’t have to make separate tables for every analysis with different time unit. Any insights are appreciated.

>Solution :

Because average of average is not the same as complete average.

This is a common misconception. Let’s try to understand with the help of an example –

Consider a list with 2 elements a and b

x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))

Now, similar to your question we will take average in 2 ways –

Average of all the values of x

res1 <- mean(unlist(x))
res1
#[1] 4.25

Average of each element separately and then complete average.

sapply(x, mean)
#       a        b 
#3.833333 5.500000 

res2 <- mean(sapply(x, mean))
res2
#[1] 4.666667

Notice that res1 and res2 has different values because the 2nd case is average of averages.

The same logic applies in your case as well when you take daily average and then monthly which is average of averages.

mean

byMR

Published December 15, 2022

Add a comment

How "2" < "123" is false in Javascript but not 2 < "123"?

byMR

December 15, 2022

Questions

Cant initialize a 2d array/matrix to 0

byMR

December 15, 2022

Questions

Django Haystack update index only for 1 model

byMR

December 15, 2022

Questions

Get value from asymmetrically nested dict with unique keys

byMR

December 15, 2022

Questions

What do the third and fourth arguments of the mat-palette function do (not the first/second) and how do I use those for my own components

byMR

December 15, 2022

Questions

dplyr filter based on conditions across and within column

byMR

December 15, 2022

Calculating MAD in two different ways in R return different results

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

How "2" < "123" is false in Javascript but not 2 < "123"?

Cant initialize a 2d array/matrix to 0

Django Haystack update index only for 1 model

Get value from asymmetrically nested dict with unique keys

What do the third and fourth arguments of the mat-palette function do (not the first/second) and how do I use those for my own components

dplyr filter based on conditions across and within column

Keep Up to Date with the Most Important News

Calculating MAD in two different ways in R return different results

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

How "2" < "123" is false in Javascript but not 2 < "123"?

Cant initialize a 2d array/matrix to 0

Django Haystack update index only for 1 model

Get value from asymmetrically nested dict with unique keys

What do the third and fourth arguments of the mat-palette function do (not the first/second) and how do I use those for my own components

dplyr filter based on conditions across and within column

Discover more from Dev solutions