Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

(Python/Pandas) Subset DataFrame based on non-missing values from a column

I have a pd dataframe:

import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])

print(df)
   column1 column2 column3
0     None    None    None
1     None    None    None
2     None    None    None
3        4    None       3
4        8       5    None
5        9       1       7
6     None    None    None
7     None    None    None
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
12    None    None    None

I want to subset the rows between the values in column 3, and get rid of all empty rows. My desired outcomes are:

print (df1)   
   column1 column2 column3
0        4    None       3
1        8       5    None
2        9       1       7

print(df2)
   column1 column2 column3
0     None       6       7
1        2       3    None
2        3       3    None
3        5    None       1

I don’t care about the actual values column3. Column 3 values are used to indicate "start" and "stop".

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can find the non-na value, then perform a cumulative sum, then mod 2 to get the "groups" of start and one-less-than stop positions. Shifting this by 1, adding to the original, and clipping to (0, 1) gets clumps of the start and stop points.
To label the groups, you can take a diff of 1, then clip to (0, 1) again, and cum sum, then multiply those two together.

g_small = (~df.column3.isna()).cumsum().mod(2)
g = (g_small  + g_small .shift(1, fill_value=0)).clip(0,1)

groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g

You can then do a groupby operation on the data frame:

dfs = {i: g for i, g in df.groupby(groups) if i > 0}

dfs
# returns:
{1:
   column1 column2 column3
 3       4    None       3
 4       8       5    None
 5       9       1       7,

 2:
    column1 column2 column3
 8     None       6       7
 9        2       3    None
 10       3       3    None
 11       5    None       1}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading