Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Subset data table by all unique entries in columns whose name contain a certain substring, fill with NA for other entries

I have a data table and I’d like to copy or subset to all the unique observations in variables whose variable name contains a certain substring. I can conceive of a way to do it clunkily with for loops, but that seems to defeat the utility and speed of a data table.

For example:

library(data.table)
library(dplyr)
dt <- setDT(tibble(a_ID = c(1, 1, 2, 11, 2),
                   b = c('x','y','z','z', 'x'),
                   b_ID = c('XY','XY','XY','XY', 'XY'),
                   d = 1:5))

What I would like to do from here is subset by all columns whose name contains ‘ID’, returning only the unique entries from each column, and since that will result in columns with differing amounts of observations for each, fill with NA for the rest.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

That is, I’d like to return the following:

subset_dt <- setDT(tibble(a_ID = c(1,2,11),
                          b_ID = c('XY', NA,NA)))

Is this possible with data.table functionality?

>Solution :

We may get the unique elements and then replace the duplicated with NA

library(data.table)
dt[, lapply(.SD, unique), .SDcols = patterns("_ID$")][,
    lapply(.SD, \(x) replace(x, duplicated(x), NA))]

-output

    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>

Or another option with unique

unique(dt[, .(a_ID, b_ID)])[, lapply(.SD, \(x) fcase(!duplicated(x), x))]
    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>

Or another option is to block the code, check for the lengths after the unique step and append NA to fix the length

dt[, {lst1 <- lapply(.SD, unique)
     mx <- max(lengths(lst1))
    lapply(lst1, `length<-`, mx)}, .SDcols = patterns("_ID$")]
    a_ID   b_ID
   <num> <char>
1:     1     XY
2:     2   <NA>
3:    11   <NA>
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading