Home How to create 2 new column in DataFrame based on the highest values in rest of column with appropriate prefix in Python Pandas?

Questions

How to create 2 new column in DataFrame based on the highest values in rest of column with appropriate prefix in Python Pandas?

byMR

January 20, 2023

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111  | 10          | 10          | 320       | 120
222  | 15          | 80          | 500       | 500
333  | 0           | 0           | 110       | 350
444  | 20          | 5           | 0         | 0
555  | 0           | 0           | 0         | 0
666  | 10          | 20          | 30        | 50

Requirements:

I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.
Visit Medevel
- if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
- if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT
If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM

Desire output:

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B  | TOP_COUNT   | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111  | 10          | 10          | 320       | 120        | COUNT_COL_A | SUM_COL_A 
222  | 15          | 80          | 500       | 500        | COUNT_COL_B | SUM_COL_B  
333  | 0           | 0           | 110       | 350        | NaN         | SUM_COL_B  
444  | 20          | 5           | 0         | 0          | COUNT_COL_A | NaN
555  | 0           | 0           | 0         | 0          | NaN         | NaN
666  | 10          | 20          | 60        | 50         | COUNT_COL_B | SUM_COL_A

How can i do that in Python Pandas ?

>Solution :

You can use idxmax function as follows:

df['TOP_COUNT'] = df[['COUNT_COL_A' , 'COUNT_COL_B']].idxmax(axis="columns")
df['TOP_SUM'] = df[[' SUM_COL_A','SUM_COL_B']].idxmax(axis="columns")

df.loc[(df[['COUNT_COL_A' , 'COUNT_COL_B']]==0).all(axis=1), 'TOP_COUNT'] = pd.NA
df.loc[(df[['SUM_COL_A','SUM_COL_B']]==0).all(axis=1), 'TOP_SUM'] = pd.NA