Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

aggregating and counting data table column based on condition

I have a dataframe which looks like

structure(list(treaty = c("abc", "de", "abc", "de", "de"), SEX1 = c("M", 
"F", "F", "M", "F"), age = c(20, 30, 40, 50, 70)), class = "data.frame", row.names = c(NA, 
-5L))

I convert this into a datatable as the size is huge and perform operations on it. i want to be able to aggregate and calculate to get the following output on the data table

Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

treaty Prop (M) Prop (F) Avg Age
abc 1/2 1/2 30
de 1/3 2/3 50

basically 1. group by treaty id 2. show proportion of M and F based on each treaty total for the datatable and average age.

Thanks in advance

>Solution :

Assigning your dput structure as df, using data.table you can do:

dt <- data.table::data.table(df)

dt[, .(propM = mean(SEX1 == "M"), 
       propF = mean(SEX1 == "F"), 
       AveAge = mean(age)),
   .(treaty)]

Output

   treaty     propM     propF AveAge
1:    abc 0.5000000 0.5000000     30
2:     de 0.3333333 0.6666667     50

Or using dplyr you could simply summarize all of these operations by treaty:

df %>%
  summarize(propM = mean(SEX1 == "M"),
            propM = mean(SEX1 == "F"),
            AveAge = mean(age),
            .by = treaty)

Output:

  treaty     propM     propF AveAge
1    abc 0.5000000 0.5000000     30
2     de 0.3333333 0.6666667     50

It looks like in this case, dplyr may be a bit faster:

n <- 1e6
set.seed(123)

test_df <- data.frame(treaty = sample(c("abc", "de"), n, replace = TRUE),
                      SEX1 = sample(c("M", "F"), n, replace = TRUE),
                      age = sample(10:50, n, replace = TRUE))
test_dt <- data.table::data.table(test_df)

microbenchmark::microbenchmark(
  datatable = test_dt[, .(propM = mean(SEX1 == "M"), 
                     propF = mean(SEX1 == "F"), 
                     AveAge = mean(age)),
                 .(treaty)],
  dplyr = test_df %>%
    summarize(propM = mean(SEX1 == "M"),
              propF = mean(SEX1 == "F"),
              AveAge = mean(age),
              .by = treaty)
)

#Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval
# datatable 52.12843 59.04811 76.12773 64.93694 77.99761 364.6247   100
#     dplyr 40.47459 50.06535 62.74962 57.95659 65.37164 313.1679   100
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading