Home Using aggregate/group_by in R to group data and give a count for each factor variable?

Questions

Using aggregate/group_by in R to group data and give a count for each factor variable?

December 8, 2021

I have a dataframe that looks like this. I am showing the first 6 rows out of simplicity but the total number of rows is 8236. The grades range from 0-2. I have just shown grade 0 and 1 in the example below:

 Telangiectasia_time      grade
  <chr>                    <int>
1 telangiectasia_tumour_0      0
2 telangiectasia_tumour_1      0
3 telangiectasia_tumour_12     0
4 telangiectasia_tumour_24     0
5 telangiectasia_tumour_0      1
6 telangiectasia_tumour_1      1

I want to group by Telangiectasia_Time(first column) and then count the number of grades for each group. Therefore, for the first 6 rows as an example, it should look like this:

       Telangiectasia_time grade0    grade1    grade2 
1  telangiectasia_tumour_0    1      1          0
2  telangiectasia_tumour_1    1      1          0
3 telangiectasia_tumour_12    1      0          0
4 telangiectasia_tumour_24    1      0          0

There are three columns at the end for the respective grades and a count of each grade for each variable. I tried to use the aggregate function:

**aggregate(grade ~ Telangiectasia_time, telangiectasia_tumour_data, *sum*)**

but I am not sure what to put in the last bit in the brackets so that the sum of each grade is returned. When I put sum, it simply adds the numbers and does not treat the variables as seperate (0,1 and 2). Using my full dataset I get the output which is wrong:

      Telangiectasia_time grade
1  telangiectasia_tumour_0    18
2  telangiectasia_tumour_1    11
3 telangiectasia_tumour_12    38
4 telangiectasia_tumour_24    87

I have also tried group_by() but this just gives me a total count

telangiectasia_tumour_data %>% group_by(Telangiectasia_time) %>% summarize(count =n())

  Telangiectasia_time      count
* <chr>                    <int>
1 telangiectasia_tumour_0   2059
2 telangiectasia_tumour_1   2059
3 telangiectasia_tumour_12  2059
4 telangiectasia_tumour_24  2059

>Solution :

With dpylr::count and tidyr::pivot_wider you could do:

library(dplyr)
library(tidyr)

telangiectasia_tumour_data %>% 
  count(Telangiectasia_time, grade) %>% 
  pivot_wider(names_from = grade, values_from = n, names_prefix = "grade", values_fill = 0)
#> # A tibble: 4 × 3
#>   Telangiectasia_time      grade0 grade1
#>   <chr>                     <int>  <int>
#> 1 telangiectasia_tumour_0       1      1
#> 2 telangiectasia_tumour_1       1      1
#> 3 telangiectasia_tumour_12      1      0
#> 4 telangiectasia_tumour_24      1      0

DATA

telangiectasia_tumour_data <- structure(list(Telangiectasia_time = c(
  "telangiectasia_tumour_0",
  "telangiectasia_tumour_1", "telangiectasia_tumour_12", "telangiectasia_tumour_24",
  "telangiectasia_tumour_0", "telangiectasia_tumour_1"
), grade = c(
  0L,
  0L, 0L, 0L, 1L, 1L
)), class = "data.frame", row.names = c(
  "1",
  "2", "3", "4", "5", "6"
))