I have a dataframe that has about 300,000 rows so I’m trying summarise some of it.
It also has about 48 different cols, but for now I’m only interested in 2 of them.
Here’s an example of what the data look like
dput(df)
structure(list(pnum = c("1951bbn", "1951bbn", "1951bbn", "1951bbn",
"1951bbn", "1951bbn", "1951bbn", "1951ggl", "1951ggl", "1951ggl",
"1951ggl", "1951ggl", "1951ggl", "1951ggl", "1951ggl", "1952bbn",
"1952bbn", "1952bbn", "1952bbn", "1952bbn", "1952bbn", "1952bbn",
"1952bbn", "1952ggl", "1952ggl", "1952ggl", "1952ggl", "1952ggl",
"1952ggl", "1952ggl", "1952ggl", "1952kkl", "1952kkl", "1952kkl",
"1952kkl", "1952kkl", "1952kkl", "1952kkl", "1952kkl", "1952kkl",
"1952llk", "1952llk", "1952llk", "1952llk", "1952llk", "1952llk",
"1952llk", "1952llk", "1952llk", "1953hhj", "1953hhj", "1953hhj",
"1953hhj", "1953hhj", "1953hhj", "1953hhj", "1953hhj", "1953hhj",
"1953ggf", "1953ggf", "1953ggf", "1953ggf", "1953ggf", "1953ggf",
"1953ggf", "1953ffl", "1953ffl", "1953ffl", "1953ffl", "1953ffl",
"1953ffl", "1953ffl", "1953ffl"), year = c(1951L, 1951L, 1951L,
1951L, 1951L, 1951L, 1951L, 1951L, 1951L, 1951L, 1951L, 1951L,
1951L, 1951L, 1951L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L,
1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L,
1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L,
1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L, 1952L,
1952L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L,
1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L,
1953L, 1953L, 1953L, 1953L, 1953L, 1953L, 1953L)), class = "data.frame", row.names = c(NA,
-73L))
What I would like to do is simply count the number of unique pnum per year
I’ve been trying variations on
#iinstall.packages("tidyverse")
library(tidyverse)
df %>%
group_by(year) %>%
count(pnum)
A tibble: 9 × 3
# Groups: year [3]
year pnum n
<int> <chr> <int>
1 1951 1951bbn 7
2 1951 1951ggl 8
3 1952 1952bbn 8
4 1952 1952ggl 8
5 1952 1952kkl 9
6 1952 1952llk 9
7 1953 1953ffl 8
8 1953 1953ggf 7
9 1953 1953hhj 9
This isn’t what I need as it counts the number of occurences of any pnum per year
If I try use n_distinct
df %>%
group_by(year) %>%
count(n_distinct(pnum))
# A tibble: 3 × 3
# Groups: year [3]
year `n_distinct(pnum)` n
<int> <int> <int>
1 1951 9 15
2 1952 9 34
3 1953 9 24
This isn’t what I need either
What I need is only how many different distinct pnums are in each year.
year n
1951 2
1952 4
1953 3
>Solution :
You could count the unique values by grouping on the year and summarise the results with n_distinct like this:
library(dplyr)
df %>%
group_by(year) %>%
summarise(n = n_distinct(pnum))
#> # A tibble: 3 × 2
#> year n
#> <int> <int>
#> 1 1951 2
#> 2 1952 4
#> 3 1953 3
Created on 2023-08-07 with reprex v2.0.2