Home How to summarize by group while retrieving values from columns that weren't summarized

Questions

How to summarize by group while retrieving values from columns that weren't summarized

December 6, 2021

I’m trying to summarize a data frame, while grouping by a variable. My problem is that when doing such summarizing procedure, I lose other columns that I need.

Consider the following data:

df <- 
  tibble::tribble(
    ~id, ~year, ~my_value,
    1,   2010,  2,
    1,   2013,  2,
    1,   2014,  2,
    2,   2010,  4,
    2,   2012,  3,
    2,   2014,  4,
    2,   2015,  2,
    3,   2015,  3,
    3,   2010,  3,
    3,   2011,  3
  )

I want to group by id in order to collapse my_value to a single value. I use the following algorithm:

IF all values of my_value are identical, then simply return the first value, i.e, my_value[1].
ELSE return the smallest value, i.e., min(my_value).

So I wrote a small function that does it:

my_func <- function(x) {
  if (var(x) == 0) {
    return(x[1])
  }
  # else:
  min(x)
}

And now I can use either dplyr or data.table to summarize by id:

library(dplyr)
library(data.table)

# dplyr
df %>%
  group_by(id) %>%
  summarise(my_min_val = my_func(my_value))
#> # A tibble: 3 x 2
#>      id my_min_val
#>   <dbl>      <dbl>
#> 1     1          2
#> 2     2          2
#> 3     3          3

# data.table
setDT(df)[, .(my_min_val = my_func(my_value)), by = "id"]
#>    id my_min_val
#> 1:  1          2
#> 2:  2          2
#> 3:  3          3

So far so good. My problem is that I lost the year value. I want the respective year value for each chosen my_value.

My desired output should look like:

# desired output
desired_output <- 
  tribble(~id, ~my_min_val, ~year,
          1,   2,           2010,  # because for id 1, var(my_value) is 0, and hence my_value[1] corresponds to year 2010
          2,   2,           2015,  # because for id 2, var(my_value) is not 0, and hence min(my_value) (which is 2) corresponds to year 2015
          3,   3,           2015)  # because for id 3, var(my_value) is 0, hence my_value[1] corresponds to year 2015

I especially seek a data.table solution because my real data is very large (over 1 million rows) and with many groups. Thus efficiency is important. Thanks!

>Solution :

We may use the condition in slice

library(dplyr)
my_func <- function(x) if(var(x) == 0) 1 else which.min(x)
df %>% 
   group_by(id) %>% 
   slice(my_func(my_value)) %>%
   ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

Or using data.table

library(data.table)
setDT(df)[df[, .I[my_func(my_value)], id]$V1]
   id year my_value
1:  1 2010        2
2:  2 2015        2
3:  3 2015        3

Or with slice_min and with_ties = FALSE

df %>%
    group_by(id) %>% 
    slice_min(n = 1, order_by = my_value, with_ties = FALSE)  %>%
    ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

data.table

byMR

Published December 06, 2021

Add a comment

Fatal runtime exception when trying to get filename from URI

byMR

December 6, 2021

Questions

Removing unneeded information in filename using bash script

byMR

December 6, 2021

Questions

How to remove duplication in pyspark array

byMR

December 6, 2021

Questions

How to check the URL of a page using JavaScript

byMR

December 6, 2021

Questions

Pandas : Changing a column of dataset from string to integer

byMR

December 6, 2021

Questions

This error handler cannot process 'SerializationException's directly; please consider configuring an 'ErrorHandlingDeserializer'

byMR

December 6, 2021

How to summarize by group while retrieving values from columns that weren't summarized

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Fatal runtime exception when trying to get filename from URI

Removing unneeded information in filename using bash script

How to remove duplication in pyspark array

How to check the URL of a page using JavaScript

Pandas : Changing a column of dataset from string to integer

This error handler cannot process 'SerializationException's directly; please consider configuring an 'ErrorHandlingDeserializer'

Keep Up to Date with the Most Important News

How to summarize by group while retrieving values from columns that weren't summarized

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Fatal runtime exception when trying to get filename from URI

Removing unneeded information in filename using bash script

How to remove duplication in pyspark array

How to check the URL of a page using JavaScript

Pandas : Changing a column of dataset from string to integer

This error handler cannot process 'SerializationException's directly; please consider configuring an 'ErrorHandlingDeserializer'

Discover more from Dev solutions