Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Apply "any" or "all" function row-wise to arbitrary number of Boolean columns in Julia DataFrames.jl

Suppose I have a dataframe with multiple boolean columns representing certain conditions:

df = DataFrame(
         id = ["A", "B", "C", "D"], 
         cond1 = [true, false, false, false], 
         cond2 = [false, false, false, false], 
         cond3 = [true, false, true, false]
)
id cond1 cond2 cond3
1 A 1 0 1
2 B 0 0 0
3 C 0 0 1
4 D 0 0 0

Now suppose I want to identify rows where any of these conditions are true, ie "A" and "C". It is easy to do this explicitly:

df[:, :all] = df.cond1 .| df.cond2 .| df.cond3

But how can this be done when there are an arbitrary number of conditions, for example something like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df[:, :all] = any.([ df[:, Symbol("cond$i")] for i in 1:3 ])

The above fails with DimensionMismatch("tried to assign 3 elements to 4 destinations") because the any function is being applied column-wise, rather than row-wise. So the real question is: how to apply any row-wise to multiple Boolean columns in a dataframe?

The ideal output should be:

id cond1 cond2 cond3 all
1 A 1 0 1 1
2 B 0 0 0 0
3 C 0 0 1 1
4 D 0 0 0 0

>Solution :

Here is one way to do it:

julia> df = DataFrame(
                id = ["A", "B", "C", "D", "E"],
                cond1 = [true, false, false, false, true],
                cond2 = [false, false, false, false, true],
                cond3 = [true, false, true, false, true]
       )
5×4 DataFrame
 Row │ id      cond1  cond2  cond3
     │ String  Bool   Bool   Bool
─────┼─────────────────────────────
   1 │ A        true  false   true
   2 │ B       false  false  false
   3 │ C       false  false   true
   4 │ D       false  false  false
   5 │ E        true   true   true

julia> transform(df, AsTable(r"cond") .=> ByRow.([maximum, minimum]) .=> [:any, :all])
5×6 DataFrame
 Row │ id      cond1  cond2  cond3  any    all
     │ String  Bool   Bool   Bool   Bool   Bool
─────┼───────────────────────────────────────────
   1 │ A        true  false   true   true  false
   2 │ B       false  false  false  false  false
   3 │ C       false  false   true   true  false
   4 │ D       false  false  false  false  false
   5 │ E        true   true   true   true   true

Note that it is quite fast even for very wide tables:

julia> df = DataFrame(rand(Bool, 10_000, 10_000), :auto);

julia> @time transform(df, AsTable(r"x") .=> ByRow.([maximum, minimum]) .=> [:any, :all]);
  0.059275 seconds (135.41 k allocations: 103.038 MiB)

In the examples I have used a regex column selector, but of course you can use any row selector you like.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading