Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to summarize by group while retrieving values from columns that weren't summarized

I’m trying to summarize a data frame, while grouping by a variable. My problem is that when doing such summarizing procedure, I lose other columns that I need.

Consider the following data:

df <- 
  tibble::tribble(
    ~id, ~year, ~my_value,
    1,   2010,  2,
    1,   2013,  2,
    1,   2014,  2,
    2,   2010,  4,
    2,   2012,  3,
    2,   2014,  4,
    2,   2015,  2,
    3,   2015,  3,
    3,   2010,  3,
    3,   2011,  3
  )

I want to group by id in order to collapse my_value to a single value. I use the following algorithm:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  1. IF all values of my_value are identical, then simply return the first value, i.e, my_value[1].
  2. ELSE return the smallest value, i.e., min(my_value).

So I wrote a small function that does it:

my_func <- function(x) {
  if (var(x) == 0) {
    return(x[1])
  }
  # else:
  min(x)
}

And now I can use either dplyr or data.table to summarize by id:

library(dplyr)
library(data.table)

# dplyr
df %>%
  group_by(id) %>%
  summarise(my_min_val = my_func(my_value))
#> # A tibble: 3 x 2
#>      id my_min_val
#>   <dbl>      <dbl>
#> 1     1          2
#> 2     2          2
#> 3     3          3

# data.table
setDT(df)[, .(my_min_val = my_func(my_value)), by = "id"]
#>    id my_min_val
#> 1:  1          2
#> 2:  2          2
#> 3:  3          3

So far so good. My problem is that I lost the year value. I want the respective year value for each chosen my_value.

My desired output should look like:

# desired output
desired_output <- 
  tribble(~id, ~my_min_val, ~year,
          1,   2,           2010,  # because for id 1, var(my_value) is 0, and hence my_value[1] corresponds to year 2010
          2,   2,           2015,  # because for id 2, var(my_value) is not 0, and hence min(my_value) (which is 2) corresponds to year 2015
          3,   3,           2015)  # because for id 3, var(my_value) is 0, hence my_value[1] corresponds to year 2015

I especially seek a data.table solution because my real data is very large (over 1 million rows) and with many groups. Thus efficiency is important. Thanks!

>Solution :

We may use the condition in slice

library(dplyr)
my_func <- function(x) if(var(x) == 0) 1 else which.min(x)
df %>% 
   group_by(id) %>% 
   slice(my_func(my_value)) %>%
   ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

Or using data.table

library(data.table)
setDT(df)[df[, .I[my_func(my_value)], id]$V1]
   id year my_value
1:  1 2010        2
2:  2 2015        2
3:  3 2015        3

Or with slice_min and with_ties = FALSE

df %>%
    group_by(id) %>% 
    slice_min(n = 1, order_by = my_value, with_ties = FALSE)  %>%
    ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading