Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to find out which situation has the most missing data

I have a dataset like this (df)

ID Community Time BIP EXP
1 1 BF 12000 500
1 1 DF NA NA
1 1 AF 8000 NA
2 1 BF 13000 300
2 1 DF 12000 200
2 1 AF 11000 120

This df has 40’000 observations. I would like to find out which of these situation in the time column (BF= Before financial crisis; DF = during finance crisis; AF= After Finance Crisis) has the most missing data in all the columns BIP and EXP?

I gave the following code to find out how many missing data are in BIP and EXP

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

sapply(df, function(x) sum(is.na(x)))

It shows that BIP has 55 missing data and EXP has 34 missing data. But no information for the time situation.

Could someone please help?

>Solution :

You can use aggregate:

aggregate(cbind(BIP = is.na(BIP), EXP = is.na(EXP)) ~ Time, data = dat, sum)
#Also works
#with(dat, aggregate(list(BIP = BIP, EXP = EXP), list(Time = Time), function(x) sum(is.na(x))))

  Time BIP EXP
1   AF   0   1
2   BF   0   0
3   DF   1   1

Or in dplyr:

library(dplyr)
dat %>% 
  group_by(Time) %>% 
  summarise(across(BIP:EXP, ~ sum(is.na(.x))))

  Time    BIP   EXP
1 AF        0     1
2 BF        0     0
3 DF        1     1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading