Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find names of n largest values in each row of dataframe

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:

import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data

          a         b         c         d         e
0  0.374540  0.950714  0.731994  0.598658  0.156019
1  0.155995  0.058084  0.866176  0.601115  0.708073
2  0.020584  0.969910  0.832443  0.212339  0.181825
3  0.183405  0.304242  0.524756  0.431945  0.291229
4  0.611853  0.139494  0.292145  0.366362  0.456070

I want the names of the largest contributors in each row. So for n = 2 the output would be:

0  b  c
1  c  e
2  b  c
3  c  d
4  a  e

I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Another option using numpy.argpartition to find the top n index per row and then extract column names by index:

import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]

#array([['c', 'b'],
#       ['e', 'c'],
#       ['c', 'b'],
#       ['d', 'c'],
#       ['e', 'a']], dtype=object)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading