Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Backfilling Rows Based on Max Conditions in R

I have a dataset that looks like this in R:

name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)

problem_data = data.frame(name, year, age)

   name year age
1  john 2010  21
2  john 2011  21
3  john 2015  21
4  alex 2014  55
5  alex 2016  55
6 peter 2010  61
7 peter 2011  61
8 peter 2012  61
9 peter 2013  61

In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example – in reality:

  • Peter was 61 in 2013
  • Peter was 60 in 2012
  • Peter was 59 in 2011
  • Peter was 58 in 2010

Sometimes years are missing – as a result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • Alex was 55 in 2016
  • Alex was 53 in 2014

I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" – but I am not sure how these concepts can be combined together to achieve this:

# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds

library(dplyr)

new_data = problem_data %>% 
    group_by(name) %>% 
    mutate(real_age= age - lag(age, default = age[1]))

Bur this approach has made everyone’s age as 0!

# A tibble: 9 x 4
# Groups:   name [3]
  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 john   2010    21        0
2 john   2011    21        0
3 john   2015    21        0
4 alex   2014    55        0
5 alex   2016    55        0
6 peter  2010    61        0
7 peter  2011    61        0
8 peter  2012    61        0
9 peter  2013    61        0

Can someone please show me how to fix this problem?

Thank you!

>Solution :

Group by ‘name’, subtract the ‘age’ from the sequence of group index

library(dplyr)
library(tidyr)
problem_data %>% 
    group_by(name) %>% 
    complete(year = full_seq(year, period = 1)) %>% 
    fill(year, age, .direction = "downup") %>%
    mutate(real_age= age - (row_number() - 1)) %>%
   ungroup
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading