I have a simple data frame consisting of (account) IDs and default rates for five years. Many default rates are missing. The data can be generated as follows:
ID = rep(1:50, each = 5)
def= rnorm(n=250, mean=0.5, sd=0.2)
ind= which(def %in% sample(def, 100))
def[ind] = NA
df = data.frame(ID = ID, Def = def)
And looks like this:
> head(df, 20)
ID Def
1 1 0.39506938
2 1 NA
3 1 0.42946603
4 1 NA
5 1 NA
6 2 0.45125199
7 2 0.40519126
8 2 NA
9 2 0.65082718
10 2 NA
11 3 NA
12 3 0.46132736
13 3 0.06324983
14 3 0.72630862
15 3 0.63996092
16 4 0.72093890
17 4 NA
18 4 NA
19 4 0.61471461
20 4 0.51788498
How can show the ID numbers for which at least 4 of the 5 default rates are not NAs?
>Solution :
You may try
library(dplyr)
df %>%
group_by(ID) %>%
dplyr::summarize(p = sum(!is.na(Def))/n()) %>%
filter(p >= 0.8) %>% # or > 0.8?
pull(ID)
[1] 2 8 9 11 13 17 20 23 25 27 28 29 33 38 44 45 47 49