I am creating a master dataset. I need to calculate a new variable which returns the column position of the maximum value in each row. It is real life data so there are missing observations throughout the master dataset.
data.frame(v1 = c(3,2,NA,6),
v2 = c(NA,1,6,7),
v3 = c(1,1,NA,1)) -> df1
df1
# output
#
# v1 v2 v3
# 1 3 NA 1
# 2 2 1 1
# 3 NA 6 NA
# 4 6 7 1
What I need can be calculated easily using which.max
within mutate()
in dplyr.
library(dplyr)
df1 %>%
rowwise %>%
mutate(maxCol = which.max(c_across(everything())))
# Ouput
#
# A tibble: 4 × 4
# Rowwise:
# v1 v2 v3 maxCol
# <dbl> <dbl> <dbl> <int>
# 1 3 NA 1 1
# 2 2 1 1 1
# 3 NA 6 NA 2
# 4 6 7 1 2
If there are NAs in a row which.max simply ignores them. However if a row has ALL NAs…
data.frame(v1 = c(3,2,NA,6),
v2 = c(NA,1,NA,7),
v3 = c(1,1,NA,1)) -> df2
df2
# output
#
# v1 v2 v3
# 1 3 NA 1
# 2 2 1 1
# 3 NA NA NA
# 4 6 7 1
…we run into a problem
df2 %>%
rowwise %>%
mutate(maxCol = which.max(c_across(everything())))
# Error in `mutate()`:
# ℹ In argument: `maxCol = which.max(c_across(everything()))`.
# ℹ In row 3.
# Caused by error:
# ! `maxCol` must be size 1, not 0.
# ℹ Did you mean: `maxCol = list(which.max(c_across(everything())))` ?
# Run `rlang::last_trace()` to see where the error occurred.
If we follow the advice in the error message we CAN calculate a variable by putting everything into a list
df2 %>%
rowwise %>%
mutate(maxCol = list(which.max(c_across(everything())))) -> dfList
dfList
# output
#
# A tibble: 4 × 4
# Rowwise:
# v1 v2 v3 maxCol
# <dbl> <dbl> <dbl> <list>
# 1 3 NA 1 <int [1]>
# 2 2 1 1 <int [1]>
# 3 NA NA NA <int [0]>
# 4 6 7 1 <int [1]>
However this creates all sorts of other problems. For example if you unlist()
the maxCol
variable it returns 3 values instead of 4.
Can anyone suggest a way around this using dplyr, whereby which.max
returns NA
when it encounters a row with all NA
rather than returning integer(0)
? Or maybe an alternative approach within dplyr using another function?
I can’t simply remove the rows with all NA for the variables in question from the master dataset because the rows in question are only all NA for the variables used to calculate the new variable, i.e. they have valid observations for other variables that prevent me from deleting them listwise.
>Solution :
We can use an ifelse
to catch all rows with NA
(set these rows to NA
, if not, proceed with your command).
library(dplyr)
df2 %>%
rowwise() %>%
mutate(which.max = ifelse(all(is.na(c_across(everything()))), NA, which.max(c_across(everything()))))
# A tibble: 4 × 4
# Rowwise:
v1 v2 v3 which.max
<dbl> <dbl> <dbl> <int>
1 3 NA 1 1
2 2 1 1 1
3 NA NA NA NA
4 6 7 1 2