Home In R, how do you find the number of different values between 2 character strings?

Questions

In R, how do you find the number of different values between 2 character strings?

August 2, 2024

I’m trying to see the number of new employees a manager got between time one and time 2. I have a string of all employee ids that roll up under that manager.

My below code always says there is 1 new employee, but as you can see, there’s 2. How do I find out how many new employees there are? The ids aren’t guaranteed to always be in the same order, but they will always be split by a ", ".

library(dplyr)
library(stringr)

#First data set
mydata_q2 <- tibble(
  leader = 1,
  reports_q2 = "2222, 3333, 4444"
) 

#Second dataset
mydata_q3 <- tibble(
  leader = 1,
  reports_q3 = "2222, 3333, 4444, 55555, 66666" 
) 

#Function to count number of new employees
calculate_number_new_emps <- function(reports_time1, reports_time2) {
  time_1_reports <- ifelse(is.na(reports_time1), character(0), str_split(reports_time1, " ,\\s*")[[1]])
  time_2_reports <- str_split(reports_time2, " ,\\s*")[[1]]
  num_new_employees <- length(setdiff(time_1_reports, time_2_reports))
  num_new_employees
}

#Join data and count number of new staff--get wrong answer
mydata_q2 %>%
  left_join(mydata_q3) %>%
  mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))

EDIT:

The output that I want is for new_staff_count = 2 for this example.

That’s because there are 2 new employees (55555 and 66666) in q3 that weren’t in time q2.

>Solution :

Your separation in str_split is not correct. Just split on ", ". Then find the difference on the length between the two vectors.

calculate_number_new_emps <- function(reports_time1, reports_time2) {
   if (is.na(reports_time1)) 
      {time_1_reports <-character(0)}
   else 
      {time_1_reports <- str_split(reports_time1, ", ")[[1]]}
   
   print(time_1_reports)
   time_2_reports <- str_split(reports_time2, ", ")[[1]]
   num_new_employees <- length(time_2_reports) - length(time_1_reports)
   num_new_employees
}

#Join data and count number of new staff--get wrong answer
mydata_q2 %>%
   left_join(mydata_q3) %>%
   mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))