Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas: add a grouping variable to clusters of rows that meet a criteria

I have this dataframe:

df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
              2, 2, False, 3, 3, 3, False]})

The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1, cluster_2, cluster_3 etc. I’ve given an example of the hoped for output with the column cluster_number. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this? Thanks in advance!

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use a groupby.ngroup on the groups of successive values pre-filtered to the True:

# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()

# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
                        .groupby(m).ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )

Or with arithmetics:

m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
                        .mask(df['forms_a_cluster'],
                              m//2 + df['forms_a_cluster'].iloc[0])
                       )

Output:

    forms_a_cluster cluster_number
0             False          False
1             False          False
2              True              1
3              True              1
4              True              1
5             False          False
6             False          False
7             False          False
8              True              2
9              True              2
10            False          False
11             True              3
12             True              3
13             True              3
14            False          False

Other example:

    forms_a_cluster cluster_number
0              True              1
1              True              1
2             False          False
3              True              2
4              True              2
5             False          False
6             False          False
7             False          False
8              True              3
9              True              3
10            False          False
11             True              4
12             True              4
13             True              4
14            False          False
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading