Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

ifelse statement to assign values to a new column, working with lists of numeric values

I have a dataframe that looks something like this:

# Minimal example dataframe

identifier <- c(
  "A",
  "B",
  "C",
  "D",
  "E",
  "F"
)

value_1 <- c(
  "1231811, 1231877",
  "1231911, 1233069, 1232767",
  "1231919",
  NA,
  "1232135, 1233145",
  NA
)

value_2 <- c(
  1231811,
  190477,
  922661,
  950711,
  992647,
  NA
  
)

value_3 <- c(
  1231877,
  1233069,
  9774041,
  9774041,
  1314063,
  1231379
  
)

test_df <- data.frame(identifier, value_1, value_2, value_3)

  identifier                   value_1 value_2 value_3
1          A          1231811, 1231877 1231811 1231877
2          B 1231911, 1233069, 1232767  190477 1233069
3          C                   1231919  922661 9774041
4          D                      <NA>  950711 9774041
5          E          1232135, 1233145  992647 1314063
6          F                      <NA>    <NA> 1231379

I want to create a new column, "final_value," and fill it in with a single value from value_1, value_2, or value_3 following a hierarchy that prioritizes value_1 values that match values in value_2 followed by value_3. If value_1 is not NA and does not have values that match anything in value_2 or value_3, I want to fill final_value with the first value in the comma-separated value_1 string. If value_1 is NULL, fill final_value with value_2 or, if that is also null, fill in with value_3. The final dataframe would look like this:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2          B 1231911, 1233069, 1232767  190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3          C                   1231919  922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4          D                      <NA>  950711 9774041 950711  # value_1 is NA, so fill in with value_2
5          E          1232135, 1233145  992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6          F                      <NA>    <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3

Here’s my approach so far…

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(purrr)
library(dplyr)

# change value_1 column into a list of numeric values 
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))

# create a new column to hold the final selected value
test_df$final_value <- NA

# ifelse statement
test_df$final_value <- 
  
  # if any of the elements in value_1 match the value_2 value, fill the new column with value_2
  ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
         
         # otherwise, if a value in value_1 matches value_3, fill in with value_3
         ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
                
                # if none of the values in value_1 match the other columns, fill in with the first value_1 list value
                ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
                       
                       # if value_1 is NA, fill in with value_2
                       ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
                              
                              # if value_1 is NA and value_2 is NA, fill in with value_3
                              ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
         
         
  )))))

There are a few problems with the result:

  identifier                   value_1 value_2 value_3               final_value
1          A          1231811, 1231877 1231811 1314063          1231811, 1231877
2          B 1231911, 1233069, 1232767  190477 1233069 1231911, 1233069, 1232767
3          C                   1231919  922661 9774041                   1231919
4          D                        NA  950711 9774041                    950711
5          E          1232135, 1233145  992647 1314063          1232135, 1233145
6          F                        NA      NA 1231379                   1231379

The first three lines of the ifelse are not working as anticipated. It is failing to return the matching value_2 or value_3 value in final_value and I also cannot get it to return the first list item from value_1 where there aren’t any matching value_2 or value_3 values. For the latter, I’ve tried specifying test_df$value_1[[1]][1] (and similar) but this only returns the first item in the identifer A value_1 list:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1314063     1231811
2          B 1231911, 1233069, 1232767  190477 1233069     1231811
3          C                   1231919  922661 9774041     1231811
4          D                        NA  950711 9774041      950711
5          E          1232135, 1233145  992647 1314063     1231811
6          F                        NA      NA 1231379     1231379

Any help would be greatly appreciated.

>Solution :

First, nesting ifelse beyond 2-deep generally leads me to suggest case_when. However, in this case I think there is a much better solution without that:

func func <- function(A, ...) {
  if (length(A) == 1L && is.na(A)) {
    if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
  } else {
    L <- lapply(list(...), intersect, x = A)
    L <- c(L[lengths(L) > 0], A)
    L[[1]][1]
  }
}

library(dplyr)
test_df %>%
  mutate(
    final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
  )
#   identifier                   value_1 value_2 value_3 final_value
# 1          A          1231811, 1231877 1231811 1231877     1231811
# 2          B 1231911, 1233069, 1232767  190477 1233069     1233069
# 3          C                   1231919  922661 9774041     1231919
# 4          D                      <NA>  950711 9774041      950711
# 5          E          1232135, 1233145  992647 1314063     1232135
# 6          F                      <NA>      NA 1231379     1231379

Because I use ... in func, this handles "0 or more" other value_* variables as you want; if you have 3 or 30 more, it will apply the same logic. Further, the order within ... matters, those listed earlier will be prioritized higher for matches.

The c(L[lengths(L) > 0], A) ensures (1) we only consider value_* that have non-empty intersections (first portion), and if all of those are empty, we use what is found in A. (In the unlikely event that A is NA and all value_* are empty, then … you get NA.)

FYI, one inner-step of this is to split your strings of comma-separated numbers into a list-column using strsplit. If you’re going to do more and similar operations that need to work on individual components within, you may prefer to keep it as such using mutate(value_1 = strsplit(value_1, "[ ,]+")) (or similar).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading