Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Collapsing multiple observations based on specific parameters in R

I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were only asked during one wave. Most answers were on a scale (e.g. how much do you agree) or a binary. Each individual has a unique numeric ID so I know which rows to collapse. My dependent variable was only surveyed in wave 4.

The data looks something like this:

df <- data.frame (ID  = c(1, 1, 2, 2, 2, 2, 3, 4, 4), PANEL = c(1, 4, 1, 2, 3, 4, 2, 3, 4),
AGE = c(68, 68, 52, 52, 52, 52, 43, 33, 33), Q4 = c(2, 2, 1, 1, 1, 1, 2, 2, 1),
Q4_1 = c(2, 2, 1, 1, 1, 1, 2, 2, 1), Q4_1 = c(2, NA, NA, 3, NA, 3, 2, 3, NA),
Q5 = c(10, 10, 8, 9, 8, 7, 6, 6, 5))

df

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     1  68  2    2      2 10
2  1     4  68  2    2     NA 10
3  2     1  52  1    1     NA  8
4  2     2  52  1    1      3  9
5  2     3  52  1    1     NA  8
6  2     4  52  1    1      3  7
7  3     2  43  2    2      2  6
8  4     3  33  2    2      3  6
9  4     4  33  1    1     NA  5

etc...

Sometimes each individual’s answers are the same across waves, but not always. I do not need to know how the answers vary in time and the waves were relatively close in time. I just need a profile of each individual surveyed.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Ideally, for each individual I’d want to keep the answers given in wave 4 (if they took part in it), but substituting the NA answers with what they answered in previous waves. I was wondering if there is any way to do this without going through every individual one by one, given the amount of data. I’ll also have to remove data for individuals who did not take part in wave 4 at all.

If successful, the chunk of data above would end up looking something like this (+ I’ll remove the panel column later):

   ID  PANEL  AGE  Q4  Q4_1 Q5
1   1     4    68   2    2  10 
2   2     4    52   1    3   7 
3   4     4    33   1    3   5

etc...

I’ve been looking into dplyr’s summarise() function but it doesn’t seem like I can be that specific with what I need to merge and not merge. It wouldn’t be a problem if some of the answers were merged by getting an average of the individual’s responses across waves, but that would not work for the binary answers if the individual changed their mind in between waves.

>Solution :

You may be looking for tidyr::fill().. You can do the following on your provided example dataset:

df %>% group_by(ID) %>% fill(., starts_with("Q")) %>% filter(PANEL == 4)

Output:

  ID PANEL AGE Q4 Q4_1 Q4_1.1 Q5
1  1     4  68  2    2      2 10
2  2     4  52  1    1      3  7
3  4     4  33  1    1      3  5
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading