aggregating and counting data table column based on condition

September 27, 2023

I have a dataframe which looks like

structure(list(treaty = c("abc", "de", "abc", "de", "de"), SEX1 = c("M", 
"F", "F", "M", "F"), age = c(20, 30, 40, 50, 70)), class = "data.frame", row.names = c(NA, 
-5L))

I convert this into a datatable as the size is huge and perform operations on it. i want to be able to aggregate and calculate to get the following output on the data table

Output:

treaty	Prop (M)	Prop (F)	Avg Age
abc	1/2	1/2	30
de	1/3	2/3	50

basically 1. group by treaty id 2. show proportion of M and F based on each treaty total for the datatable and average age.

Thanks in advance

>Solution :

Assigning your dput structure as df, using data.table you can do:

dt <- data.table::data.table(df)

dt[, .(propM = mean(SEX1 == "M"), 
       propF = mean(SEX1 == "F"), 
       AveAge = mean(age)),
   .(treaty)]

Output

   treaty     propM     propF AveAge
1:    abc 0.5000000 0.5000000     30
2:     de 0.3333333 0.6666667     50

Or using dplyr you could simply summarize all of these operations by treaty:

df %>%
  summarize(propM = mean(SEX1 == "M"),
            propM = mean(SEX1 == "F"),
            AveAge = mean(age),
            .by = treaty)

Output:

  treaty     propM     propF AveAge
1    abc 0.5000000 0.5000000     30
2     de 0.3333333 0.6666667     50

It looks like in this case, dplyr may be a bit faster:

n <- 1e6
set.seed(123)

test_df <- data.frame(treaty = sample(c("abc", "de"), n, replace = TRUE),
                      SEX1 = sample(c("M", "F"), n, replace = TRUE),
                      age = sample(10:50, n, replace = TRUE))
test_dt <- data.table::data.table(test_df)

microbenchmark::microbenchmark(
  datatable = test_dt[, .(propM = mean(SEX1 == "M"), 
                     propF = mean(SEX1 == "F"), 
                     AveAge = mean(age)),
                 .(treaty)],
  dplyr = test_df %>%
    summarize(propM = mean(SEX1 == "M"),
              propF = mean(SEX1 == "F"),
              AveAge = mean(age),
              .by = treaty)
)

#Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval
# datatable 52.12843 59.04811 76.12773 64.93694 77.99761 364.6247   100
#     dplyr 40.47459 50.06535 62.74962 57.95659 65.37164 313.1679   100