Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

rbind 2 data frames by the elements in 2 columns, avoiding nested loops

I have two data frames like the following:

df1 <- data.frame(Marker1=c('+','+','+','-','-'), Marker2=c('+','+','+','+','-'), Marker3=c('+','-','+','-','+'),
                  Sample=c(1,1,2,3,3), Population_ID=c(1,2,1,5,6), Cells_in_Sample=c(443,23,567,98,3))
df2 <- data.frame(Population_ID=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,5,5,5,5,5,5,5,6,6,6,6,6,6,6),
                  Marker1=c('+','+','+','+',NA,NA,NA,'+','+','+','+',NA,NA,NA,'-','-','-','-',NA,NA,NA,'-','-','-','-',NA,NA,NA),
                  Marker2=c('+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'-','-',NA,NA,'-','-',NA),
                  Marker3=c('+',NA,'+',NA,'+',NA,'+','-',NA,'-',NA,'-',NA,'-','-',NA,'-',NA,'-',NA,'-','+',NA,'+',NA,'+',NA,'+'))

They look like this:

> df1
  Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1       +       +       +      1             1             443
2       +       +       -      1             2              23
3       +       +       +      2             1             567
4       -       +       -      3             5              98
5       -       -       +      3             6               3
> head(df2)
  Population_ID Marker1 Marker2 Marker3
1             1       +       +       +
2             1       +       +    <NA>
3             1       +    <NA>       +
4             1       +    <NA>    <NA>
5             1    <NA>       +       +
6             1    <NA>       +    <NA>

df1 contains my "base" populations with combinations of 3 markers (all 3 present), plus the counts (Cells_in_Sample) of each population per Sample.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df2 takes the unique combinations of 3 markers and makes all the possible combinations of 1 and 2 out of them. Note that df2 already includes the "base" populations from df1.

What I want to do here is just produce a final_df combining both, in an efficient and elegant way, avoiding nested loops if possible.

The final_df should preserve the Sample and counts values for each "base" 3-marker combination in df1, extended to all the "sub-combinations" in df2. Thus, I should rbind them by Sample and Population_ID.

Now I managed to do this, using nested for loops, but I am wondering whether there is a better solution.

This is what I have done:

final_df <- NULL
for (s in unique(df1$Sample)){
  df1_sub <- subset(df1, Sample==s)
  for (p in df1_sub$Population_ID){
    df1_sub_sub <- subset(df1_sub, Population_ID==p)
    df2_sub <- subset(df2, Population_ID==p)
    df2_sub$Sample <- s
    df2_sub$Cells_in_Sample <- df1_sub_sub$Cells_in_Sample
    df2_sub <- df2_sub[,c(2,3,4,5,1,6)]
    #note there is no need to rbind df1_sub_sub and df2_sub
    #cause df2 already contains the populations from df1
    final_df <- rbind(final_df, df2_sub)
  }
}

final_df looks exactly like I want. I paste it full below for reference:

> final_df
    Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1         +       +       +      1             1             443
2         +       +    <NA>      1             1             443
3         +    <NA>       +      1             1             443
4         +    <NA>    <NA>      1             1             443
5      <NA>       +       +      1             1             443
6      <NA>       +    <NA>      1             1             443
7      <NA>    <NA>       +      1             1             443
8         +       +       -      1             2              23
9         +       +    <NA>      1             2              23
10        +    <NA>       -      1             2              23
11        +    <NA>    <NA>      1             2              23
12     <NA>       +       -      1             2              23
13     <NA>       +    <NA>      1             2              23
14     <NA>    <NA>       -      1             2              23
15        +       +       +      2             1             567
16        +       +    <NA>      2             1             567
17        +    <NA>       +      2             1             567
18        +    <NA>    <NA>      2             1             567
19     <NA>       +       +      2             1             567
20     <NA>       +    <NA>      2             1             567
21     <NA>    <NA>       +      2             1             567
151       -       +       -      3             5              98
161       -       +    <NA>      3             5              98
171       -    <NA>       -      3             5              98
181       -    <NA>    <NA>      3             5              98
191    <NA>       +       -      3             5              98
201    <NA>       +    <NA>      3             5              98
211    <NA>    <NA>       -      3             5              98
22        -       -       +      3             6               3
23        -       -    <NA>      3             6               3
24        -    <NA>       +      3             6               3
25        -    <NA>    <NA>      3             6               3
26     <NA>       -       +      3             6               3
27     <NA>       -    <NA>      3             6               3
28     <NA>    <NA>       +      3             6               3

Is there a straightforward and efficient way to do this without nested loops? My actual data is many times bigger than this.

Thanks!

>Solution :

You don’t need loops at all, this is a case for merge.
In the code below I make a copy of final_df just for test purposes, to keep the expected result as it was.

df3 <- final_df
row.names(df3) <- NULL

# join removing the marker columns from df1
mrg <- merge(df2, df1[4:6], by = "Population_ID")[c(2:5, 1, 6)]
mrg <- mrg[order(mrg$Sample), ]
row.names(mrg) <- NULL

identical(df3, mrg)
#> [1] TRUE

Created on 2023-09-29 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading