Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Percentage of non-empty values by rows in a dataframe

Assume I have this dataframe:

import pandas as pd

d =  {"A" : [3,4,7,6,5,8,None,4,7],
      "B" : ['a',None,'d',None,'j','u','k','k','z'],
      "C" : [4,None,8,4,8,None,2,7,None],
      "D" : [None,9,5,None,2,5,3,None,7],
      "E": ['m','n','l',None,'s','g','s','u','a']}
df = pd.DataFrame(d)

################################

     A     B    C    D     E
0  3.0     a  4.0  NaN     m
1  4.0  None  NaN  9.0     n
2  7.0     d  8.0  5.0     l
3  6.0  None  4.0  NaN  None
4  5.0     j  8.0  2.0     s
5  8.0     u  NaN  5.0     g
6  NaN     k  2.0  3.0     s
7  4.0     k  7.0  NaN     u
8  7.0     z  NaN  7.0     a

What I need is a dictionary, which shows me all the possible combinations of columns with the percentage of empty (or filled) rows, something like this e.g.:

result = {2/9:["A","B","C","D","E"],
          2/9:["A","B","C","E"],
          ...,
          1/9:["A","C"]}

I’ve added also a little image to illustrate my problem:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

This would help me to proper slice and dice the dataframe according to the coverage of non-empty values. Thank you for your help!

>Solution :

Mask the data and you can groupby an count the rows with size:

out = (df.notnull()
         .groupby(list(df)).size()
         .div(len(df))
      )
print(out.reset_index(name='pct'))

Output (False indicates a null value, True indicates a non-null)

       A      B      C      D      E       pct
0  False   True   True   True   True  0.111111
1   True  False  False   True   True  0.111111
2   True  False   True  False  False  0.111111
3   True   True  False   True   True  0.222222
4   True   True   True  False   True  0.222222
5   True   True   True   True   True  0.222222

If you want a dictionary, you can try:

{
    tuple(df.columns[list(idx)]):val for idx, val in out.iteritems()
}

Output:

{('B', 'C', 'D', 'E'): 0.1111111111111111,
 ('A', 'D', 'E'): 0.1111111111111111,
 ('A', 'C'): 0.1111111111111111,
 ('A', 'B', 'D', 'E'): 0.2222222222222222,
 ('A', 'B', 'C', 'E'): 0.2222222222222222,
 ('A', 'B', 'C', 'D', 'E'): 0.2222222222222222
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading