Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Perform row-wise operation in datatable with multiple elements

I have the following data table:

library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))

Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let’s assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).

Edit: Exemplary code to do this only for X_1 and group A:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT

One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.

I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.

Thanks for any hints how to approach this.

>Solution :

You can do this:

DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]

This returns N rows per x, where N is the size of x’s group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I’m adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)

Output:

          x  group    V1
      <int> <char> <int>
   1:    68      A   144
   2:    68      A   107
   3:    68      A    92
   4:    68      A   121
   5:    68      A   160
  ---                   
4996:     4      B    25
4997:     4      B    66
4998:     4      B    83
4999:     4      B    27
5000:     4      B    68

You can also accomplish this via a join:

DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]

Output:

        ind     x  group     y total
      <int> <int> <char> <int> <int>
   1:     1    68      A    76   144
   2:     1    68      A    39   107
   3:     1    68      A    24    92
   4:     1    68      A    53   121
   5:     1    68      A    92   160
  ---                               
4996:   100     4      B    21    25
4997:   100     4      B    62    66
4998:   100     4      B    79    83
4999:   100     4      B    23    27
5000:   100     4      B    64    68
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading