Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to purge missing values from a DataFrame in Julia?

After reading the context, if you felt the title could be enhanced to fit the question and you had an idea, feel free to update it.
Suppose I have the following DataFrame:

using DataFrames
df = DataFrame(
  g=["a","b","a","c",missing,missing,missing,missing],
  a=[1,2,3,4,missing,missing,missing,missing],
  Column1=[missing,missing,missing,missing,false,false,false,true],
  Column2=[missing,missing,missing,missing,false,true,true,true],
  Column3=[missing,missing,missing,missing,true,true,false,false],
)
# 8×5 DataFrame
#  Row │ g        a        Column1  Column2  Column3
#      │ String?  Int64?   Bool?    Bool?    Bool?
# ─────┼─────────────────────────────────────────────
#    1 │ a              1  missing  missing  missing
#    2 │ b              2  missing  missing  missing
#    3 │ a              3  missing  missing  missing
#    4 │ c              4  missing  missing  missing
#    5 │ missing  missing    false    false     true
#    6 │ missing  missing    false     true     true
#    7 │ missing  missing    false     true    false
#    8 │ missing  missing     true     true    false

I want to convert it to this:

# 8×5 DataFrame
#  Row │ g        a        Column1  Column2  Column3
#      │ String?  Int64?   Bool?    Bool?    Bool?
# ─────┼─────────────────────────────────────────────
#    1 │ a              1    false    false     true
#    2 │ b              2    false     true     true
#    3 │ a              3    false     true    false
#    4 │ c              4     true     true    false

I tried:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

DataFrame(collect.(skipmissing.(eachcol(df))), names(df))

But I think this is not an optimal way since I’m using the collect function. Is there any better way to do it?

>Solution :

For me a natural way to do it would be:

julia> mapcols(x -> filter(!ismissing, x), df)
4×5 DataFrame
 Row │ g        a       Column1  Column2  Column3
     │ String?  Int64?  Bool?    Bool?    Bool?
─────┼────────────────────────────────────────────
   1 │ a             1    false    false     true
   2 │ b             2    false     true     true
   3 │ a             3    false     true    false
   4 │ c             4     true     true    false

However, this assumes that number of missing values in every column is the same (but I guess this is what you have in this exercise – right?).

skipmissing is designed for cases when user wants a non-copying iterable skipping missing values (which is not the case here).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading