Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

De-aggregate a data frame

There have been many similar questions (e.g. Repeat each row of data.frame the number of times specified in a column, De-aggregate / reverse-summarise / expand a dataset in R, Repeating rows of data.frame in dplyr), but my data set is of a different structure than the answers to these questions assume.

I have a data frame with the frequencies of measurements within each group and the total number of observations for each outcome per group total_N:

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3))
# A tibble: 2 x 4
  group total_N outcome_A outcome_B
  <chr>   <dbl>     <dbl>     <dbl>
1 A           4         1         2
2 B           5         4         3

I want to de-aggregate the data, so that the data frame has as many rows as total observations and each outcome has a 1 for all observations with the outcome and a 0 for all observations without the outcome. Thus the final result should be a data frame like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# A tibble: 9 x 3
  group outcome_A outcome_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As the aggregated data does not contain any information about the frequency of combinations (i.e., the correlation) of outcome_A and outcome_B, this can be ignored.

>Solution :

Here’s a tidyverse solution.

As you say, it’s easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it’s easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.

library(tidyverse)

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>% 
  uncount(total_N) %>% 
  group_by(group) %>% 
  mutate(
    across(
      starts_with("measure"), 
      function(x) as.numeric(row_number() <= x)
    )
  ) %>%
  ungroup()
# A tibble: 9 × 3
  group measure_A measure_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading