I am doing survival analysis using R and need repeating row until new value is seen.
here is my data frame:
df<- data.frame(province=c(10,10,10,10,10,10,10,10,12,12,12,12,12,12,12,12),
year=c(2000,2000,2001,2001,2001,2002,2002,2002,2000,2000,2000,2001,2001,2002,2002,2002),
residence=c(1,1,1,1,2,1,1,2,1,2,1,1,2,1,2,1),
edu=c(1,2,1,2,3,1,2,3,2,1,3,2,1,2,1,3),
pro=c(0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0))
what I want is repeating row grouped by province , residence and edu until pro reach to 1. for some row which do not reach to 1, row repeated for all years (in my case from 2000 to 2002) . it seems I can do this by a while loop but I do not know the procedure.
my expected output would be like this:
province residence edu pro year
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 1 1 0 2000
2 10 1 1 0 2001
3 10 1 1 0 2002
4 10 1 2 0 2000
5 10 1 2 0 2001
6 10 1 2 1 2002
7 10 2 3 1 2001
8 12 1 2 1 2000
9 12 2 1 0 2000
10 12 2 1 0 2001
11 12 2 1 1 2002
12 12 1 3 0 2000
13 12 1 3 0 2001
14 12 1 3 0 2002
thank you in advance.
>Solution :
Perhaps I’m misinterpreting. If your first frame with 16 rows is truly the original data, and you’re trying to get to the second frame with 14 rows, then this method works.
df %>%
select(-pro) %>%
group_by(province, residence, edu) %>%
summarize(year = setdiff(min(year):max(year), year)) %>%
bind_rows(df) %>%
arrange(province, residence, edu, year) %>%
tidyr::fill(pro) %>%
filter(!cumany(lag(pro == 1, default = FALSE))) %>%
ungroup()
# # A tibble: 14 x 5
# province residence edu year pro
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10 1 1 2000 0
# 2 10 1 1 2001 0
# 3 10 1 1 2002 0
# 4 10 1 2 2000 0
# 5 10 1 2 2001 0
# 6 10 1 2 2002 1
# 7 10 2 3 2001 1
# 8 12 1 2 2000 1
# 9 12 1 3 2000 0
# 10 12 1 3 2001 0
# 11 12 1 3 2002 0
# 12 12 2 1 2000 0
# 13 12 2 1 2001 0
# 14 12 2 1 2002 1
Data
df <- structure(list(province = c(10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12), year = c(2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2002, 2002, 2002), residence = c(1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1), edu = c(1, 2, 1, 2, 3, 1, 2, 3, 2, 1, 3, 2, 1, 2, 1, 3), pro = c(0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA, -16L))