Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

data.table, filter >= median per group and keep two lowest

Situation & Goal

I’m having a large table that looks like (simplified):

|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A       |Y      |    50|
|A       |Z      |    60|
|A       |ZZZZ   |    80|
|A       |XX     |    90|
|A       |X      |   100|
|B       |XYXY   |    15|
|B       |XXX    |    50|
|B       |YY     |    60|
|B       |ZZZ    |   150|
|B       |ZZ     |   400|

Now I want to filter each group (MainCat) and keep only the two lowest values (Value) that are equal/greater than median:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

|MainCat |SubCat | Value|Comment               |
|:-------|:------|-----:|:---------------------|
|A       |Y      |    50|-                     |
|A       |Z      |    60|-                     |
|A       |ZZZZ   |    80|Median, First to keep |
|A       |XX     |    90|Second to keep        |
|A       |X      |   100|-                     |
|B       |XYXY   |    15|-                     |
|B       |XXX    |    50|-                     |
|B       |YY     |    60|Median, First to keep |
|B       |ZZZ    |   150|Second to keep        |
|B       |ZZ     |   400|-                     |

Expected result:

|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A       |ZZZZ   |    80|
|A       |XX     |    90|
|B       |YY     |    60|
|B       |ZZZ    |   150|

My (failed) attempt

I tried df2[Value >= df2[MainCat==MainCat, median(Value, na.rm=TRUE)]] but this calculates a Median for all values, without grouping. Can somebody help? As performance is key, I prefer a data.table solution if possible. Thank you very much.


MWE

Base data:

df2 = structure(list(MainCat = c("A", "A", "A", "A", "A", "B", "B", 
                                 "B", "B", "B"), SubCat = c("Y", "Z", "ZZZZ", "XX", "X", "XYXY", 
                                                "XXX", "YY", "ZZZ", "ZZ"), Value = c(50, 60, 80, 90, 100, 15, 
                                                             50, 60, 150, 400)), row.names = c(NA, -10L), class = c("data.table", 
                                                                "data.frame"))

Result:

data.table(MainCat=c("A","A","B","B"),
                 SubCat=c("ZZZZ", "XX", "YY", "ZZZ"),
                 Value=c(80,90,60,150))

>Solution :

Do a group by ‘MainCat’, get the row index (.I) after creating the logical expression with the median ‘Value’, extract the index ($V1), subset the data, order by the ‘MainCat’, ‘Value’, get the first two rows with head, grouped by ‘MainCat’

library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
    ][order(MainCat, Value), head(.SD, 2), MainCat]

-output

   MainCat SubCat Value
    <char> <char> <num>
1:       A   ZZZZ    80
2:       A     XX    90
3:       B     YY    60
4:       B    ZZZ   150
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading