Home Is there an R function to do grouped operations on a data frame without collapsing it?

Questions

Is there an R function to do grouped operations on a data frame without collapsing it?

February 28, 2023

Apologies if the question isn’t formulated correctly in the title, I am fairly new to this and still not used to the exact terminology.

I need to add a column to a data frame that contains the result of operations (e.g., mean, sum, etc.) grouped by values in other columns. At the same time, I would not like to lose any of the data in the data frame. However, I haven’t been able to find a function that does that, or a solution of any kind.

Let me illustrate with an example:

df

    Name Round Points
1   John     1      6
2   John     2      4
3   John     3      6
4   John     4      8
5  Annie     1      7
6  Annie     2      4
7  Annie     3      3
8  Annie     4      6
9   Bill     1      4
10  Bill     2      5
11  Bill     3      8
12  Bill     4      5

I would like to add a column at to the dataframe that contains the average points of each player displayed in each row:

    Name Round Points Avg.points
1   John     1      6        6.0
2   John     2      4        6.0
3   John     3      6        6.0
4   John     4      8        6.0
5  Annie     1      7        5.0
6  Annie     2      4        5.0
7  Annie     3      3        5.0
8  Annie     4      6        5.0
9   Bill     1      4        5.5
10  Bill     2      5        5.5
11  Bill     3      8        5.5
12  Bill     4      5        5.5

However, if I use group_by(Name) I lose part of the data:

df <- df %>% 
  group_by(Name) %>% 
  summarise(Avg.points = mean(Points, na.rm = TRUE))

df

# A tibble: 3 × 2
  Name  Avg.points
  <chr>      <dbl>
1 Annie        5  
2 Bill         5.5
3 John         6

If I use group_by(Name, Round, Points) I don’t get the correct average:

df <- df %>% 
  group_by(Name, Round, Points) %>% 
  summarise(Avg.points = mean(Points, na.rm = TRUE))

df

# A tibble: 12 × 4
# Groups:   Name, Round [12]
   Name  Round Points Avg.points
   <chr> <dbl>  <dbl>      <dbl>
 1 Annie     1      7          7
 2 Annie     2      4          4
 3 Annie     3      3          3
 4 Annie     4      6          6
 5 Bill      1      4          4
 6 Bill      2      5          5
 7 Bill      3      8          8
 8 Bill      4      5          5
 9 John      1      6          6
10 John      2      4          4
11 John      3      6          6
12 John      4      8          8

Thanks to anyone who could help me out with this.

>Solution :

You can use the across function within summarize to easily keep any columns that you want to keep. Using across(everything()) will maintain everything in your dataframe.

Try:

library(dplyr)
df <- df %>% 
  group_by(Name) %>% 
  summarise(across(everything()), Avg.points = mean(Points, na.rm = TRUE))

df

#> # A tibble: 12 × 4
#> # Groups:   Name [3]
#>    Name  Round Points Avg.points
#>    <chr> <dbl>  <dbl>      <dbl>
#>  1 Annie     1      7        5  
#>  2 Annie     2      4        5  
#>  3 Annie     3      3        5  
#>  4 Annie     4      6        5  
#>  5 Bill      1      4        5.5
#>  6 Bill      2      5        5.5
#>  7 Bill      3      8        5.5
#>  8 Bill      4      5        5.5
#>  9 John      1      6        6  
#> 10 John      2      4        6  
#> 11 John      3      6        6  
#> 12 John      4      8        6

If you want to order the dataframe by average number of points, add the arrange function:

library(dplyr)
df <- df %>% 
  group_by(Name) %>% 
  summarise(across(everything()), Avg.points = mean(Points, na.rm = TRUE)) %>%
  arrange(-Avg.points)

df

#> # A tibble: 12 × 4
#> # Groups:   Name [3]
#>    Name  Round Points Avg.points
#>    <chr> <dbl>  <dbl>      <dbl>
#>  1 John      1      6        6  
#>  2 John      2      4        6  
#>  3 John      3      6        6  
#>  4 John      4      8        6  
#>  5 Bill      1      4        5.5
#>  6 Bill      2      5        5.5
#>  7 Bill      3      8        5.5
#>  8 Bill      4      5        5.5
#>  9 Annie     1      7        5  
#> 10 Annie     2      4        5  
#> 11 Annie     3      3        5  
#> 12 Annie     4      6        5

Check out the documentation for more info about across, or see this release note from the creators of dplyr.