Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there a way to get outliers out of a column in R?

I’m trying to get outliers removed from a column of data within my data set in R but the code my professor gave me has been giving me issues. When I run it returns NA for all observations in every single column.

Here is the line of code:

MainData <- MainData[MainData$GDP_2006 < mean(MainData$GDP_2006) + sd(MainData$GDP_2006)*2, ]

Any suggestions or solutions would be heavily appreciated!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

I strongly suspect you have issues created by missing data. There are two ways to deal with this – filter out the observations with missing data first, or add na.rm=TRUE on to your mean() and sd() calls. This seems to recreate your problem:

# Create demo data
df1 <- mtcars
df1[1, "mpg"] <- NA

# Problem:
df1[df1$mpg < mean(df1$mpg) + sd(df1$mpg) * 2, ]

There are three general schools of thought on how to approach this task – base R, tidyverse and data.table. Here they are – my personal preference is data.table but tidyverse is extremely popular.

# Base R way ===========================================================
# Solution 1 (use na.rm):
df1[df1$mpg < mean(df1$mpg, na.rm=TRUE) + sd(df1$mpg, na.rm=TRUE) * 2, ]
# Solution 2 (filter out NAs first):
df1 <- df1[!is.na(df1$mpg),]
df1[df1$mpg < mean(df1$mpg) + sd(df1$mpg) * 2, ]


# Tidyverse way ========================================================
# Set up:
library(dplyr)

# Solution 1 (use na.rm):
df1 %>% 
  filter(mpg < mean(mpg, na.rm = TRUE) + sd(mpg, na.rm = TRUE)*2)

# Solution 2 (filter out NAs first):
df1 %>% 
  filter(!is.na(mpg)) %>% 
  filter(mpg < mean(mpg) + sd(mpg)*2)


# Data.table way =======================================================
# Set up:
library(data.table)
setDT(df1, keep.rownames = TRUE)

# Solution 1 (use na.rm):
df1[mpg < mean(mpg, na.rm=TRUE) + sd(mpg, na.rm=TRUE) * 2]

# Solution 2 (filter out NAs first):
df1[!is.na(mpg)][mpg < mean(mpg) + sd(mpg) * 2]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading