I’ve got a large dataframe that I would like to apply a filter to. The condition is that if the difference between an element and it’s previous value is greater than 40, I would like to set that element to NA.
Mini-data sample:
mydataframe <- data.frame(matrix(runif(n=150, min=1, max=100), nrow=10))
I was able to create a matrix based on the difference between the element and its previous element fairly easily, but using that to filter the original data frame is proving tricky. This is the calculation I used to generate the difference matrix:
calculatedifference <- function(a) {
difference = a - lag(a)
difference
}
differencematrix <- sapply(mydataframe, calculatedifference)
I tried the following, but it returned a dataframe filled with NA:
differencefilter <- function(a){
difference = a - lag(a)
case_when(
difference > 40 ~ NA
)
}
Probably because I am feeding case_when() an entire vector and it’s setting the whole thing to NA rather than whichever elements fulfill the condition. Not sure how to sequence the command for each individual element within each vector. My original plan was to build a function for a single vector, and then sapply it to the original dataframe, but if there’s a more efficient way of accomplishing the goal, that would be great too. Again, I want elements in the original mydataframe set to NA, not the values in differencematrix. Thanks for any help!
>Solution :
Your method is good, the problem is that you don’t set a default value for case_when when your condition isn’t met, and case_when‘s default default is NA. If you change your code to
case_when(
difference > 40 ~ NA, .default = a
)
I think it will work. I would go a little more directly, like this:
set.seed(47)
mydataframe <- data.frame(matrix(runif(n=50, min=1, max=100), nrow=5))
library(dplyr)
mydataframe
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 97.71924 69.44983 14.74096 90.295379 18.63639 38.622468 75.13312 44.52890 6.806878 21.08084
# 2 38.01769 39.51712 70.49673 40.649777 56.12284 4.342001 37.52843 24.84979 37.672119 66.51133
# 3 76.38870 47.42565 17.05717 4.063552 52.06370 4.652532 47.30130 97.66172 60.239314 96.34999
# 4 82.42667 54.78766 60.33139 8.064458 14.20961 54.107717 20.16720 48.89779 93.813872 37.83152
# 5 57.78090 92.56431 51.09757 47.363336 69.23906 80.155587 89.79590 68.70130 37.310393 85.85034
mydataframe |>
mutate(across(everything(), \(x) abs(ifelse(abs(x - lag(x, default = first(x))) > 40, NA, x))))
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 97.71924 69.44983 14.74096 90.295379 18.63639 38.622468 75.13312 44.52890 6.806878 21.08084
# 2 NA 39.51712 NA NA 56.12284 4.342001 37.52843 24.84979 37.672119 NA
# 3 76.38870 47.42565 NA 4.063552 52.06370 4.652532 47.30130 NA 60.239314 96.34999
# 4 82.42667 54.78766 NA 8.064458 14.20961 NA 20.16720 NA 93.813872 NA
# 5 57.78090 92.56431 51.09757 47.363336 NA 80.155587 NA 68.70130 NA NA
## or using case_when in case you have more conditions
mydataframe |>
mutate(across(everything(), \(x)
case_when(
abs(x - lag(x)) > 40 ~ NA,
.default = x
)
))
## same result as above