Home Panda get value from column based on max condition to get proper cluster names

Questions

Panda get value from column based on max condition to get proper cluster names

December 3, 2021

I’ve successfully clustered my data and am presented with the following dataframe:

     cluster_group  name value
  0              1     A    20 
  1              1     B    30 
  2              1     C    10 
  3              1     D    50 
  4              2     E    20 
  5              2     F    10 
...

What I want for better exporting, is to give the cluster_group a name instead of a integer. The name should be based on the name column with the highest value. So the result should look like this:

     cluster_name  name value
  0             D     A    20 
  1             D     B    30 
  2             D     C    10 
  3             D     D    50 
  4             E     E    20 
  5             E     F    10 
...

How would I do this in the most efficient way?

>Solution :

If name are unique per groups get always unique groups by DataFrameGroupBy.idxmax per groups in GroupBy.transform:

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             D    A     20
1             D    B     30
2             D    C     10
3             D    D     50
4             E    E     20
5             E    F     10

If possible multiple same names is possible get same clusters, so some groups should be joined together:

print (df)
   cluster_group name  value
0              1    A     20
1              1    E    300 <- max per group 1 is E
2              1    C     10
3              1    D     50
4              2    E     20  <- max per group 2 is E
5              2    F     10

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             E    A     20
1             E    E    300
2             E    C     10
3             E    D     50
4             E    E     20
5             E    F     10