Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Map values based on distribution from one dataset to another in R (data.table)

How can I map values from one data.table to another, based on their distribution?

I created some sample data below.
In my small_data, I have an overview with dates and how often they occur (expressed in count [N] and percentage [perc]). Next to that, I have a big_data table, which untill now only consists of ID values.

My goal is to map the distribution of dates from the small_data to big_data, such that 2.3% of my IDs will get date 2017-04-01, 2.3% will get date 2019-01-01 and 2.6% will get date 2019-01-01 etc…
The percentages do not have to be this exact, if it’s roughly the same, that is fine.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

set.seed(123)

#create small data (source)
small_data <- data.table(
  date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
  N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]

#calculate percentage
small_data[, perc := 100* N/small_data[, sum(N)]]

> small_data
          date  N     perc
 1: 2017-04-01  8 2.318841
 2: 2019-01-01  8 2.318841
 3: 2019-01-01  9 2.608696
 4: 2018-01-01  9 2.608696
 5: 2016-07-01 10 2.898551
 6: 2017-10-01 12 3.478261
 7: 2018-01-01 13 3.768116
 8: 2017-01-01 14 4.057971
 9: 2018-10-01 14 4.057971
10: 2019-07-01 14 4.057971
11: 2016-10-01 14 4.057971
12: 2019-04-01 15 4.347826
13: 2016-10-01 20 5.797101
14: 2019-07-01 22 6.376812
15: 2018-10-01 24 6.956522
16: 2017-07-01 27 7.826087
17: 2016-01-01 27 7.826087
18: 2017-10-01 27 7.826087
19: 2019-10-01 29 8.405797
20: 2016-10-01 29 8.405797


#create big data (target)
big_data <-  data.table(
  ID = c(1:100)
)

> head(big_data)
   ID
1:  1
2:  2
3:  3
4:  4
5:  5
6:  6

>Solution :

You can do this quite simply with sample, as @Roland has pointed out in his comment. Here is a reproducible example using your data:

library(data.table)
## Data set up ----------------------------------------------------------------
set.seed(123)
small_data <- data.table(
  date = sample(seq(as.Date('2016-01-01'), as.Date('2019-12-31'), by="quarter"), replace=TRUE, size = 20),
  N = sample(c(1:30), replace=TRUE, size=20)
)[order(N)]
small_data[, perc := 100* N/small_data[, sum(N)]]

#create big data (target)
big_data <-  data.table(
  ID = c(1:100)
)

## Generate dates for big_data based on % occurence in small_data -------------
n_dates <- nrow(big_data)

### Use small_dates[["perc"]] as the probability vector in sample()
sampled_dates <- 
  sample(small_data[["date"]], size = n_dates,
         prob = small_data[["perc"]], replace = TRUE)

big_data[["date"]] <- sampled_dates
head(big_data)
#>    ID       date
#> 1:  1 2016-07-01
#> 2:  2 2019-07-01
#> 3:  3 2019-07-01
#> 4:  4 2019-07-01
#> 5:  5 2016-07-01
#> 6:  6 2016-07-01

Created on 2022-12-14 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading