I have a dataset that has two columns. One column indicates the group and each group has only two rows. The second column represents the category. Now I would like to count the percentage of each group not having the same category. So in row 1 and 2, the Category is not the same while in row 3 and 4 it is the same. In the provided data, I would get a percentage of 66.66% as four times the Category changes while it stays the same for two groups.
This is my data:
structure(list(Group = c("A", "A", "B", "B", "C", "C", "D", "D",
"E", "E", "F", "F"), Category = c(1L, 2L, 3L, 3L, 5L, 6L, 7L,
7L, 7L, 6L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-12L))
I have tried the following so far:
Data <- Data %>%
group_by(Group) %>%
count(n())
But I don’t now how to write the code in the last line to get my desired percentage. Could someone help me here?
>Solution :
A base solution with tapply():
mean(with(df, tapply(Category, Group, \(x) length(unique(x)))) > 1)
# [1] 0.6666667
With dplyr, you could use n_distinct() to count the number of unique values.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(N = n_distinct(Category)) %>%
summarise(Percent = mean(N > 1))
# # A tibble: 1 × 1
# Percent
# <dbl>
# 1 0.667