Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filter rows from Pandas dataframe based on max value of a column

I have a pandas dataframe with following columns

col1   col2   col3
x      12     abc
x       7     abc
x       5     abc
x       3   
y      10     abc
y       9     abc

I would like to find all rows in a pandas DataFrame which have the max value for col2 column, after grouping by ‘col1’ columns after filtering the rows where col3 is null?

The expected output is:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

col1   col2   col3
x      12     abc
y      10     abc

I have tried the below code so far.

df[df[['col3']].notnull().all(1) & df.sort_values('col2').drop_duplicates(['col1'], keep='last')]

However I am getting following error.

TypeError: unsupported operand type(s) for &: 'bool' and 'float'

Any help is highly appreciated

>Solution :

How max method calculate without mentioning the column?

According to the pd.DataFrame.max it returns the maximum values over the selected axis with default being (0, index).

In your example you only have 1 numeric values and all values in col3 are identical. If col3 was also numeric, max method would return the max values of that column as well with the resulting DataFrame may have different rows than the original one.

It is suitable in this case but if you only would like the output DataFrame have the rows as the original one you need to be specific about the column whose maximum you would like to consider.

df.loc[df.notnull().all(axis=1)].groupby('col1').max().reset_index()

  col1  col2 col3
0    x    12  abc
1    y    10  abc

Or you can create a boolean Series first and assign it to a name to improve readability:

m = df.notnull().all(axis=1)
df.loc[m].groupby('col1').max().reset_index()

Let’s say now this is your original DataFrame:

  col1  col2  col3
0    x    12   2.0
1    x     7  20.0
2    x     5   1.0
3    x     3   NaN
4    y    10   4.0
5    y     9  11.0

When you apply max on this without specifying the column name it will return the following:

  col1  col2  col3
0    x    12  20.0
1    y    10  11.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading