Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create 2 new column in DataFrame based on the highest values in rest of column with appropriate prefix in Python Pandas?

I have Pandas DataFrame like below (I can add that my DataFrame is definitely bigger, so I need to do below aggregation only for selected columns):

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111  | 10          | 10          | 320       | 120
222  | 15          | 80          | 500       | 500
333  | 0           | 0           | 110       | 350
444  | 20          | 5           | 0         | 0
555  | 0           | 0           | 0         | 0
666  | 10          | 20          | 30        | 50

Requirements:

  • I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,

    MEDevel.com: Open-source for Healthcare and Education

    Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

    Visit Medevel

    • if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
  • I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,

    • if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
  • If there is 0 in both columns with prefix COUNT_ then give NaN in column TOP_COUNT

  • If there is 0 in both columns with prefix SUM_ then give NaN in column TOP_SUM

Desire output:

ID   | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B  | TOP_COUNT   | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111  | 10          | 10          | 320       | 120        | COUNT_COL_A | SUM_COL_A 
222  | 15          | 80          | 500       | 500        | COUNT_COL_B | SUM_COL_B  
333  | 0           | 0           | 110       | 350        | NaN         | SUM_COL_B  
444  | 20          | 5           | 0         | 0          | COUNT_COL_A | NaN
555  | 0           | 0           | 0         | 0          | NaN         | NaN
666  | 10          | 20          | 60        | 50         | COUNT_COL_B | SUM_COL_A

How can i do that in Python Pandas ?

>Solution :

You can use idxmax function as follows:

df['TOP_COUNT'] = df[['COUNT_COL_A' , 'COUNT_COL_B']].idxmax(axis="columns")
df['TOP_SUM'] = df[[' SUM_COL_A','SUM_COL_B']].idxmax(axis="columns")

df.loc[(df[['COUNT_COL_A' , 'COUNT_COL_B']]==0).all(axis=1), 'TOP_COUNT'] = pd.NA
df.loc[(df[['SUM_COL_A','SUM_COL_B']]==0).all(axis=1), 'TOP_SUM'] = pd.NA
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading