Why there are duplicate rows after using group_by and mutate?

February 21, 2023

The sample data is as below:

n	period	age
15	1991	5
20	1991	5
16	1991	15
29	1991	15
77	1991	25
44	1991	25

I use the following code to get the sum from the data grouped by period and age:

#The name of dataset is a.
a %>% group_by(period,age)%>%
      mutate(n = sum(n))

But the result is:

n	period	age
35	1991	5
35	1991	5
45	1991	15
45	1991	15
121	1991	25
121	1991	25

Why there is duplicate rows? It is because it sums every element in each groups?

>Solution :

You need to use the summarize() function. mutate() adds a column without consolidating the data. Here’s a reproducible example:

##Check if dplyr is installed, load if installed, install if not##
if(!require(dplyr)){
install.packages("dplyr")
}

##Creating the data##
n<-c(15,20,16,29,77,44)
period<-rep(1991, 6)
age<-c(5,5,15,15,25,25)

a<-data.frame(n=n, period=period, age=age)

##Calculation with summarize()##
a %>% group_by(period, age) %>% summarize(n= sum(n))