Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R, use newer data to update list

This question is very similar to this question that I created previously which has an answer, however I’ve come to realize the problem I’m trying to solve has evolved and I figured I should start fresh.

I have two data frames like so:

df1<-structure(list(protocol_no = c("study1", "study2", "study3", 
"study4", "study5", "study6", "study7"), status = c("New", "Open", 
"Closed", "New", "PI signoff", "Closed", "Open")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

df2<-structure(list(record_id = c(11, 12, 13, 14, 15, 16), protocol_no = c("study1", 
"study2", "study3", "study4", "study5", "study6"), status = c("New", 
"Closed", "Closed", "New", "PI signoff", "Closed"), form_1_complete = c(0, 
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")

They pretty much reference the same data, but df1 will always be newer and have more rows, whereas df2 is older and has more columns. Also, they will have 20,000+ rows in real life.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I need to update df2 with the new information from df1, this might mean new rows that will be need to be numbered (the record_id column), and it might mean updating the "status" column if its changed.

For instance in this example, the row for study7 is new and needs to be added and given record_id = 17 (because 16 is where that list left off). Additionally the status of study2 changed from Closed to Open (its ‘open’ in df1) so that needs to be changed.

Things that wouldn’t work:

In the previous solution it used binding rows and distinct, but in this scenario since study2 has changed and needs to be updated, that would bind two copies of study2 and have trouble distinguishing which to get rid of.

Output I’m looking for:

A dataframe with all 4 columns, with record_id for everything, one row per protocol (‘protocol_no’), and any status’s that have changed updated to reflect df1. Like so:

enter image description here

>Solution :

Here, a join would be enough

library(data.table)
setDT(df2)[as.data.table(df1), status := i.status, on = .(protocol_no)]

Or use rows_upsert and use the same code in the other post

library(dplyr)
library(tidyr)
rows_upsert(df2, df1) %>%
  fill(record_id) %>% 
  mutate(record_id = record_id +  (rowid(record_id) - 1))

-output

  record_id protocol_no     status form_1_complete
1        11      study1        New               0
2        12      study2       Open               0
3        13      study3     Closed               0
4        14      study4        New               0
5        15      study5 PI signoff               0
6        16      study6     Closed               0
7        17      study7       Open              NA
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading