Find names of n largest values in each row of dataframe

February 11, 2023

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:

import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data

          a         b         c         d         e
0  0.374540  0.950714  0.731994  0.598658  0.156019
1  0.155995  0.058084  0.866176  0.601115  0.708073
2  0.020584  0.969910  0.832443  0.212339  0.181825
3  0.183405  0.304242  0.524756  0.431945  0.291229
4  0.611853  0.139494  0.292145  0.366362  0.456070

I want the names of the largest contributors in each row. So for n = 2 the output would be:

0  b  c
1  c  e
2  b  c
3  c  d
4  a  e

I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?

>Solution :

Another option using numpy.argpartition to find the top n index per row and then extract column names by index:

import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]

#array([['c', 'b'],
#       ['e', 'c'],
#       ['c', 'b'],
#       ['d', 'c'],
#       ['e', 'a']], dtype=object)