suppose I have following data:
df1<- data.frame(province= c(1,1,2,3,3,3,4,4,4,4,4,5,5,5),year= c(2001,2001,2001,2001,2001,2001,2002,2002,2003,2003,2003,2004,2005,2005),
residence= c(1,1,1,2,2,2,1,1,1,2,2,2,2,2),marriage= c(1,2,2,1,2,1,1,1,2,1,1,1,2,1),count=c(4,1,3,5,3,2,2,3,2,1,2,4,2,5))
in my data marriage = 1 is ever-married and marriage = 2 is never-married. the proportion of ever-married can be estimated by column count: ever-married / ever-married + never-married
what I want is estimating the proportion of ever-married based on columns province, year and residence and two conditions:
1- if there is no ever-married based on three columns, the proportion would be 0
2- if there is no never-married based on three columns, the proportion would be 100.
my expected output would be like this:
province year residence sub
1 2001 1 0.80
2 2001 1 0.00
3 2001 2 0.70
4 2002 1 100.00
4 2003 1 0.00
4 2003 2 100.00
5 2004 2 100.00
5 2005 2 0.71
thank you in advance.
>Solution :
We group by ‘province’, ‘year’, ‘residence’, create a condition based on if/else when ‘marriage’ values 1, 2 are not present, then return 0, 100 respectively or else get the ‘count’ values that correspond to ‘marriage’ value of 1, divide by the sum of ‘count’ and then sum the proportions
library(dplyr)
df1 %>%
group_by(province, year, residence) %>%
summarise(sub = if(!any(marriage == 1)) 0
else if(!any(marriage == 2)) 100 else
sum(count[marriage == 1]/sum(count)), .groups = 'drop')
-output
# A tibble: 8 × 4
province year residence sub
<dbl> <dbl> <dbl> <dbl>
1 1 2001 1 0.8
2 2 2001 1 0
3 3 2001 2 0.7
4 4 2002 1 100
5 4 2003 1 0
6 4 2003 2 100
7 5 2004 2 100
8 5 2005 2 0.714