Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there an R function to do grouped operations on a data frame without collapsing it?

Apologies if the question isn’t formulated correctly in the title, I am fairly new to this and still not used to the exact terminology.

I need to add a column to a data frame that contains the result of operations (e.g., mean, sum, etc.) grouped by values in other columns. At the same time, I would not like to lose any of the data in the data frame. However, I haven’t been able to find a function that does that, or a solution of any kind.

Let me illustrate with an example:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df

    Name Round Points
1   John     1      6
2   John     2      4
3   John     3      6
4   John     4      8
5  Annie     1      7
6  Annie     2      4
7  Annie     3      3
8  Annie     4      6
9   Bill     1      4
10  Bill     2      5
11  Bill     3      8
12  Bill     4      5

I would like to add a column at to the dataframe that contains the average points of each player displayed in each row:

    Name Round Points Avg.points
1   John     1      6        6.0
2   John     2      4        6.0
3   John     3      6        6.0
4   John     4      8        6.0
5  Annie     1      7        5.0
6  Annie     2      4        5.0
7  Annie     3      3        5.0
8  Annie     4      6        5.0
9   Bill     1      4        5.5
10  Bill     2      5        5.5
11  Bill     3      8        5.5
12  Bill     4      5        5.5

However, if I use group_by(Name) I lose part of the data:

df <- df %>% 
  group_by(Name) %>% 
  summarise(Avg.points = mean(Points, na.rm = TRUE))

df

# A tibble: 3 × 2
  Name  Avg.points
  <chr>      <dbl>
1 Annie        5  
2 Bill         5.5
3 John         6  

If I use group_by(Name, Round, Points) I don’t get the correct average:

df <- df %>% 
  group_by(Name, Round, Points) %>% 
  summarise(Avg.points = mean(Points, na.rm = TRUE))

df

# A tibble: 12 × 4
# Groups:   Name, Round [12]
   Name  Round Points Avg.points
   <chr> <dbl>  <dbl>      <dbl>
 1 Annie     1      7          7
 2 Annie     2      4          4
 3 Annie     3      3          3
 4 Annie     4      6          6
 5 Bill      1      4          4
 6 Bill      2      5          5
 7 Bill      3      8          8
 8 Bill      4      5          5
 9 John      1      6          6
10 John      2      4          4
11 John      3      6          6
12 John      4      8          8

Thanks to anyone who could help me out with this.

>Solution :

You can use the across function within summarize to easily keep any columns that you want to keep. Using across(everything()) will maintain everything in your dataframe.

Try:

library(dplyr)
df <- df %>% 
  group_by(Name) %>% 
  summarise(across(everything()), Avg.points = mean(Points, na.rm = TRUE))

df

#> # A tibble: 12 × 4
#> # Groups:   Name [3]
#>    Name  Round Points Avg.points
#>    <chr> <dbl>  <dbl>      <dbl>
#>  1 Annie     1      7        5  
#>  2 Annie     2      4        5  
#>  3 Annie     3      3        5  
#>  4 Annie     4      6        5  
#>  5 Bill      1      4        5.5
#>  6 Bill      2      5        5.5
#>  7 Bill      3      8        5.5
#>  8 Bill      4      5        5.5
#>  9 John      1      6        6  
#> 10 John      2      4        6  
#> 11 John      3      6        6  
#> 12 John      4      8        6  

If you want to order the dataframe by average number of points, add the arrange function:

library(dplyr)
df <- df %>% 
  group_by(Name) %>% 
  summarise(across(everything()), Avg.points = mean(Points, na.rm = TRUE)) %>%
  arrange(-Avg.points)

df

#> # A tibble: 12 × 4
#> # Groups:   Name [3]
#>    Name  Round Points Avg.points
#>    <chr> <dbl>  <dbl>      <dbl>
#>  1 John      1      6        6  
#>  2 John      2      4        6  
#>  3 John      3      6        6  
#>  4 John      4      8        6  
#>  5 Bill      1      4        5.5
#>  6 Bill      2      5        5.5
#>  7 Bill      3      8        5.5
#>  8 Bill      4      5        5.5
#>  9 Annie     1      7        5  
#> 10 Annie     2      4        5  
#> 11 Annie     3      3        5  
#> 12 Annie     4      6        5 

Check out the documentation for more info about across, or see this release note from the creators of dplyr.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading