Select a given percentage of the dataset or the closest value [R]

November 7, 2022

I would like to extract the percentile of Size distribution in my dataset for different groups (In that exemple, let say 10 % and 50 %). The size are in increasing order. My idea was to filter the cumulative percentage and select the value at the edge (e.g. if I want the percentile 10 of the biggest values, I will filter the size regarding the percentage and after take the minimal value).
Nevertheless, when I try to filter the cumulative percentage with that code df <- df %>% filter(., Cum <= 10) %>% map(~slice(.,which.min(Size))), I have "NA" for two of the sites. This is because there is no value below 10 for the cumulative percentage in those sites.

How should I proceed to select the smallest cumulative value if there is no value under 10%?

df <- list(structure(list(Size = c(42, 40, 40, 37, 36, 36, 35, 35, 
35, 34, 34, 34, 33, 33, 33, 31, 30, 29, 29, 27, 26, 23), SubStation = c("B", 
"B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B"
), Cum = c(4.54545454545455, 9.09090909090909, 13.6363636363636, 
18.1818181818182, 22.7272727272727, 27.2727272727273, 31.8181818181818, 
36.3636363636364, 40.9090909090909, 45.4545454545455, 50, 54.5454545454545, 
59.0909090909091, 63.6363636363636, 68.1818181818182, 72.7272727272727, 
77.2727272727273, 81.8181818181818, 86.3636363636364, 90.9090909090909, 
95.4545454545455, 100)), row.names = c(NA, -22L), class = c("tbl_df", 
"tbl", "data.frame")), structure(list(Size = c(43, 42, 36, 36, 
35, 35, 34, 34, 34, 33, 31, 31, 30, 30, 28, 27, 27, 27, 25, 25, 
25, 25, 24, 23), SubStation = c("M1", "M1", 
"M1", "M1", "M1", "M1", 
"M1", "M1", "M1", "M1", 
"M1", "M1", "M1", "M1", 
"M1", "M1", "M1", "M1", 
"M1", "M1", "M1", "M1", 
"M1", "M1"), Cum = c(4.16666666666667, 8.33333333333333, 
12.5, 16.6666666666667, 20.8333333333333, 25, 29.1666666666667, 
33.3333333333333, 37.5, 41.6666666666667, 45.8333333333333, 50, 
54.1666666666667, 58.3333333333333, 62.5, 66.6666666666667, 70.8333333333333, 
75, 79.1666666666667, 83.3333333333333, 87.5, 91.6666666666667, 
95.8333333333333, 100)), row.names = c(NA, -24L), class = c("tbl_df", 
"tbl", "data.frame")), structure(list(Size = c(36, 34, 34, 32, 
32, 24), SubStation = c("M2", "M2", "M2", 
"M2", "M2", "M2"), Cum = c(16.6666666666667, 
33.3333333333333, 50, 66.6666666666667, 83.3333333333333, 100
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(Size = c(34, 33, 33, 28, 25, 24), SubStation = c("M3", 
"M3", "M3", "M3", "M3", 
"M3"), Cum = c(16.6666666666667, 33.3333333333333, 
50, 66.6666666666667, 83.3333333333333, 100)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame")))

>Solution :

Here with if and else functions, that does the job. I did for both P50 and P10 and joined them in a unique table (I use bind_row() to come back from a divided tibble to a dataframe).

P10 <- df %>% map(~filter(., if(any(Cum < 10)) Cum <= 10 else row_number(Cum) <= 1)) %>%
  map(~slice(.,which.min(Size))) %>% bind_rows() %>% select(P10 = Size, SubStation)
P10P50 <- df %>% map(~filter(., if(any(Cum < 50)) Cum <= 50 else row_number(Cum) <= 1)) %>%
  map(~slice(.,which.min(Size))) %>% bind_rows() %>% select(P50 = Size, SubStation) %>% inner_join(P10, by = "SubStation", copy = FALSE) %>% relocate(SubStation, P10, P50)