Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.

In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id‘s.

Some reproducible code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(dplyr)

df1 = data.frame(ids = c(1:5),
                 var = c("a","a","c","b","b"))

df2 = data.frame(var = c('c','a','b','b','d'),
                 links = 0)

Ideally, I would like a resulting dataframe that looks like:

  var    links
1   c        3
2   a   1 or 2
3   b   4 or 5
4   b   4 or 5
5   d        0

where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".

Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".

Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.

So far, I’ve gone this route, but I’m unsure about what to put after the ~

linked_df = df2 %>%
     mutate(links=case_when(links==0 & var %in% df1$var ~ 
                            sample(c(df1$ids),n(),replace=T) # unsure about this line
                            TRUE ~ links)

>Solution :

I think this is what you want. I’ve left the ids column in the result, but
it can be removed when the sampling is complete.

library(dplyr)
library(tidyr)

df1_nest = df1 %>%
  group_by(var) %>% 
  summarize(ids = list(ids))

safe_sample = function(x, ...) {
  if(length(x) == 1) return(x)
  sample(x, ...)
}

set.seed(47)
df2 %>%
  left_join(df1_nest) %>%
  mutate(
    links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
  )
# Joining, by = "var"
#   var links  ids
# 1   c     3    3
# 2   a     1 1, 2
# 3   b     4 4, 5
# 4   b     5 4, 5
# 5   d     0 NULL
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading