return indexes of all samples in python

December 5, 2022

I am beginner in python and have this data frame data that contains samples, values, and cluster numbers for each sample

df = pd.DataFrame({'samples': ['A', 'B', 'C', 'D', 'E'],
                   'values': [ 0.336663,0.447101,0.402529,0.373014,0.456226],
                   'cluster': [1, 0, 2, 0, 1]})
df

output:

    samples values  cluster
0   A   0.336663    1
1   B   0.447101    0
2   C   0.402529    2
3   D   0.373014    0
4   E   0.456226    1

in the following code, it return the max value sample of each cluster. for example for cluster 0, B has the max value among other samples (her B and D). So, it returns the index value for B which is 1, same for cluster 1, we have A and E, and E has max value, so the E index has return, here 4 and etc.

value = [] #list to store the max values
max_value = [] #list to store the max values
clust_max = [] #list to store cluster max
#loop to get the cluster value

tmp=df['values']
clust_labels=df['cluster']
clusters=len(list(set(clust_labels)))

for j in range(clusters):
    elems = [i for i, x in enumerate(clust_labels) if x == j] #get samples of cluster k
    values = [tmp[elem] for elem in elems] #get values for the sample
    max_value_temp = max(values) #get the max value
    max_value.append(max_value_temp) #store the max value
    max_ind = values.index(max_value_temp) #get the sample with max value
    clust_max.append(elems[max_ind]) #store the max value sample

output:

[1, 4, 2]

Want to update this code to return all sample indexes, not only the max values of each cluster.

The expected output:

[0, 1, 2, 3, 4]

>Solution :

I dont really get why you are using a java logic to work with pyhton, probably as mentioned you still new to it. I didnt quiet get what do you expect from the output so I did something according to what I understood.

dfc = pd.DataFrame({'samples': ['A', 'B', 'C', 'D', 'E'],
                   'values': [ 0.336663,0.447101,0.402529,0.373014,0.456226],
                   'cluster': [1, 0, 2, 0, 1]})

#get max values by cluster usign groupby
dfmax = dfc.groupby(['cluster']).max()

#insert index as a column using groupby and idxmax function
dfmax['idx'] = dfc.groupby(['cluster']).idxmax()

#you can sort values by two columns in this case values and cluster, or viceversa if you prefer which is a kinda groupby
#you are using java logic and you dont need it in pyhton, there is a pythonic way to code within python
dfsorted = dfc.sort_values(['values','cluster'], ascending=False)