Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Panda get value from column based on max condition to get proper cluster names

I’ve successfully clustered my data and am presented with the following dataframe:

     cluster_group  name value
  0              1     A    20 
  1              1     B    30 
  2              1     C    10 
  3              1     D    50 
  4              2     E    20 
  5              2     F    10 
...

What I want for better exporting, is to give the cluster_group a name instead of a integer. The name should be based on the name column with the highest value. So the result should look like this:

     cluster_name  name value
  0             D     A    20 
  1             D     B    30 
  2             D     C    10 
  3             D     D    50 
  4             E     E    20 
  5             E     F    10 
...

How would I do this in the most efficient way?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

If name are unique per groups get always unique groups by DataFrameGroupBy.idxmax per groups in GroupBy.transform:

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             D    A     20
1             D    B     30
2             D    C     10
3             D    D     50
4             E    E     20
5             E    F     10

If possible multiple same names is possible get same clusters, so some groups should be joined together:

print (df)
   cluster_group name  value
0              1    A     20
1              1    E    300 <- max per group 1 is E
2              1    C     10
3              1    D     50
4              2    E     20  <- max per group 2 is E
5              2    F     10

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             E    A     20
1             E    E    300
2             E    C     10
3             E    D     50
4             E    E     20
5             E    F     10
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading