Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

In R, how do you find the number of different values between 2 character strings?

I’m trying to see the number of new employees a manager got between time one and time 2. I have a string of all employee ids that roll up under that manager.

My below code always says there is 1 new employee, but as you can see, there’s 2. How do I find out how many new employees there are? The ids aren’t guaranteed to always be in the same order, but they will always be split by a ", ".

library(dplyr)
library(stringr)

#First data set
mydata_q2 <- tibble(
  leader = 1,
  reports_q2 = "2222, 3333, 4444"
) 

#Second dataset
mydata_q3 <- tibble(
  leader = 1,
  reports_q3 = "2222, 3333, 4444, 55555, 66666" 
) 

#Function to count number of new employees
calculate_number_new_emps <- function(reports_time1, reports_time2) {
  time_1_reports <- ifelse(is.na(reports_time1), character(0), str_split(reports_time1, " ,\\s*")[[1]])
  time_2_reports <- str_split(reports_time2, " ,\\s*")[[1]]
  num_new_employees <- length(setdiff(time_1_reports, time_2_reports))
  num_new_employees
}

#Join data and count number of new staff--get wrong answer
mydata_q2 %>%
  left_join(mydata_q3) %>%
  mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))

EDIT:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The output that I want is for new_staff_count = 2 for this example.

That’s because there are 2 new employees (55555 and 66666) in q3 that weren’t in time q2.

>Solution :

Your separation in str_split is not correct. Just split on ", ". Then find the difference on the length between the two vectors.

calculate_number_new_emps <- function(reports_time1, reports_time2) {
   if (is.na(reports_time1)) 
      {time_1_reports <-character(0)}
   else 
      {time_1_reports <- str_split(reports_time1, ", ")[[1]]}
   
   print(time_1_reports)
   time_2_reports <- str_split(reports_time2, ", ")[[1]]
   num_new_employees <- length(time_2_reports) - length(time_1_reports)
   num_new_employees
}

#Join data and count number of new staff--get wrong answer
mydata_q2 %>%
   left_join(mydata_q3) %>%
   mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading