Problem for mutate in dplyr: which.max() returns integer(0) instead of NA

Advertisements

I am creating a master dataset. I need to calculate a new variable which returns the column position of the maximum value in each row. It is real life data so there are missing observations throughout the master dataset.

data.frame(v1 = c(3,2,NA,6),
           v2 = c(NA,1,6,7),
           v3 = c(1,1,NA,1)) -> df1

df1

# output
#
#   v1 v2 v3
# 1  3 NA  1
# 2  2  1  1
# 3 NA  6 NA
# 4  6  7  1

What I need can be calculated easily using which.max within mutate() in dplyr.

library(dplyr)
df1 %>%
  rowwise %>%
    mutate(maxCol = which.max(c_across(everything())))

# Ouput
#
# A tibble: 4 × 4
# Rowwise: 
#      v1    v2    v3 maxCol
#   <dbl> <dbl> <dbl>  <int>
# 1     3    NA     1      1
# 2     2     1     1      1
# 3    NA     6    NA      2
# 4     6     7     1      2

If there are NAs in a row which.max simply ignores them. However if a row has ALL NAs…

data.frame(v1 = c(3,2,NA,6),
           v2 = c(NA,1,NA,7),
           v3 = c(1,1,NA,1)) -> df2

df2

# output
#
#   v1 v2 v3
# 1  3 NA  1
# 2  2  1  1
# 3 NA NA NA
# 4  6  7  1

…we run into a problem

df2 %>%
  rowwise %>%
    mutate(maxCol = which.max(c_across(everything())))

# Error in `mutate()`:
#   ℹ In argument: `maxCol = which.max(c_across(everything()))`.
# ℹ In row 3.
# Caused by error:
#   ! `maxCol` must be size 1, not 0.
# ℹ Did you mean: `maxCol = list(which.max(c_across(everything())))` ?
#   Run `rlang::last_trace()` to see where the error occurred.

If we follow the advice in the error message we CAN calculate a variable by putting everything into a list

df2 %>%
  rowwise %>%
    mutate(maxCol = list(which.max(c_across(everything())))) -> dfList

dfList

# output
#
# A tibble: 4 × 4
# Rowwise: 
#      v1    v2    v3 maxCol   
#   <dbl> <dbl> <dbl> <list>   
# 1     3    NA     1 <int [1]>
# 2     2     1     1 <int [1]>
# 3    NA    NA    NA <int [0]>
# 4     6     7     1 <int [1]>

However this creates all sorts of other problems. For example if you unlist() the maxCol variable it returns 3 values instead of 4.

Can anyone suggest a way around this using dplyr, whereby which.max returns NA when it encounters a row with all NA rather than returning integer(0)? Or maybe an alternative approach within dplyr using another function?

I can’t simply remove the rows with all NA for the variables in question from the master dataset because the rows in question are only all NA for the variables used to calculate the new variable, i.e. they have valid observations for other variables that prevent me from deleting them listwise.

>Solution :

We can use an ifelse to catch all rows with NA (set these rows to NA, if not, proceed with your command).

library(dplyr)

df2 %>% 
  rowwise() %>% 
  mutate(which.max = ifelse(all(is.na(c_across(everything()))), NA, which.max(c_across(everything()))))

# A tibble: 4 × 4
# Rowwise: 
     v1    v2    v3 which.max
  <dbl> <dbl> <dbl>     <int>
1     3    NA     1         1
2     2     1     1         1
3    NA    NA    NA        NA
4     6     7     1         2

Leave a ReplyCancel reply