How to find out which situation has the most missing data

October 24, 2022

I have a dataset like this (df)

ID	Community	Time	BIP	EXP
1	1	BF	12000	500
1	1	DF	NA	NA
1	1	AF	8000	NA
2	1	BF	13000	300
2	1	DF	12000	200
2	1	AF	11000	120

This df has 40’000 observations. I would like to find out which of these situation in the time column (BF= Before financial crisis; DF = during finance crisis; AF= After Finance Crisis) has the most missing data in all the columns BIP and EXP?

I gave the following code to find out how many missing data are in BIP and EXP

sapply(df, function(x) sum(is.na(x)))

It shows that BIP has 55 missing data and EXP has 34 missing data. But no information for the time situation.

Could someone please help?

>Solution :

You can use aggregate:

aggregate(cbind(BIP = is.na(BIP), EXP = is.na(EXP)) ~ Time, data = dat, sum)
#Also works
#with(dat, aggregate(list(BIP = BIP, EXP = EXP), list(Time = Time), function(x) sum(is.na(x))))

  Time BIP EXP
1   AF   0   1
2   BF   0   0
3   DF   1   1

Or in dplyr:

library(dplyr)
dat %>% 
  group_by(Time) %>% 
  summarise(across(BIP:EXP, ~ sum(is.na(.x))))

  Time    BIP   EXP
1 AF        0     1
2 BF        0     0
3 DF        1     1