Backfilling Rows Based on Max Conditions in R

November 8, 2022

I have a dataset that looks like this in R:

name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)

problem_data = data.frame(name, year, age)

   name year age
1  john 2010  21
2  john 2011  21
3  john 2015  21
4  alex 2014  55
5  alex 2016  55
6 peter 2010  61
7 peter 2011  61
8 peter 2012  61
9 peter 2013  61

In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example – in reality:

Peter was 61 in 2013
Peter was 60 in 2012
Peter was 59 in 2011
Peter was 58 in 2010

Sometimes years are missing – as a result:

Alex was 55 in 2016
Alex was 53 in 2014

I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" – but I am not sure how these concepts can be combined together to achieve this:

# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds

library(dplyr)

new_data = problem_data %>% 
    group_by(name) %>% 
    mutate(real_age= age - lag(age, default = age[1]))

Bur this approach has made everyone’s age as 0!

# A tibble: 9 x 4
# Groups:   name [3]
  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 john   2010    21        0
2 john   2011    21        0
3 john   2015    21        0
4 alex   2014    55        0
5 alex   2016    55        0
6 peter  2010    61        0
7 peter  2011    61        0
8 peter  2012    61        0
9 peter  2013    61        0

Can someone please show me how to fix this problem?

Thank you!

>Solution :

Group by ‘name’, subtract the ‘age’ from the sequence of group index

library(dplyr)
library(tidyr)
problem_data %>% 
    group_by(name) %>% 
    complete(year = full_seq(year, period = 1)) %>% 
    fill(year, age, .direction = "downup") %>%
    mutate(real_age= age - (row_number() - 1)) %>%
   ungroup