Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

groupby in pandas with custom function over a subset of rows in each group

I have a pandas DataFrame of the following format:

Input:

                                     X    [OTHER_COLUMNS]
version     branch                      
v1          overall  2475.0         -1         .
            A        1712.5          1         .
            B         257.5          2         .
            C         392.5          2
            D         112.5          3
v2          overall  2475.0         -1
            A        2341.5          1
            B          95.0          2
            C          38.5          2
v3          overall  2475.0         -1
            A        2000.0          1
            B         475.0          2
v4          overall  2475.0         -1
            A        2341.5          1
            B         133.5          1

where (version, branch) is a MultiIndex.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

PROBLEM DESCRIPTION:

I want to groupby version and set the values in the column X with branch overall to the sum of the values in the column X for the remaining branches (having the same version), weighted by the values in the column N.

EXAMPLE:

For version v2, the value in the cell with column X and branch overall should be

(2341.5 * 1 + 95.0 * 2 + 38.5 * 2) / 2475.0 = 1.05393939394,

and in pseudo-code:

(A_N * A_X + B_N * B_X) / overall_N.

Note: For a given version, the value in column N and branch overall will always be equal to the sum of the values in column N for the other branch‘es.

IDEA AND QUESTION:

I think I have to do the following:

df.loc[pd.IndexSlice[:, 'overall'], 'X'] = df.groupby('version').apply(...)

where df is the DataFrame and where ... is to be replaced by a custom function.

I am looking for help in constructing such a function.

Expected output:

                          N          X
version     branch                      
v1          overall  2475.0          1.35353535354
            A        1712.5          1
            B         257.5          2
            C         392.5          2
            D         112.5          3
v2          overall  2475.0          1.05393939394
            A        2341.5          1
            B          95.0          2
            C          38.5          2
v3          overall  2475.0          1.19191919192
            A        2000.0          1
            B         475.0          2
v4          overall  2475.0          1
            A        2341.5          1
            B         133.5          1

Explaination of expected output:

(1712.5 * 1 + 257.5 * 2 + 392.5 * 2 + 112.5 * 3) / 2475.0 = 1.35353535354

(2341.5 * 1 + 95.0 * 2 + 38.5 * 2) / 2475.0 = 1.05393939394

(2000.0 * 1 + 475.0 * 2) / 2475.0 = 1.19191919192

(2341.5 * 1 + 133.5 * 1) / 2475.0 = 1

CODE TO CREATE DATAFRAME:

import numpy as np
import pandas as pd
    
df = pd.DataFrame(
    data=np.array(
        [
            [2475.0, 1712.5, 257.5, 392.5, 112.5, 2475.0, 2341.5, 95.0, 38.5, 2475.0, 2000.0, 475.0, 2475.0, 2341.5, 133.5],
            [-1, 1, 2, 2, 3, -1, 1, 2, 2, -1, 1, 2, -1, 1, 1]
        ]
    ).T,
    index=pd.MultiIndex.from_tuples(
        tuples=[
            ('v1', 'overall'),
            ('v1', 'A'),
            ('v1', 'B'),
            ('v1', 'C'),
            ('v1', 'D'),
            ('v2', 'overall'),
            ('v2', 'A'),
            ('v2', 'B'),
            ('v2', 'C'),
            ('v3', 'overall'),
            ('v3', 'A'),
            ('v3', 'B'),
            ('v4', 'overall'),
            ('v4', 'A'),
            ('v4', 'B'),
        ],
        names=['version', 'branch'],
    ),
    columns=['N', 'X'],
)

print (df)
                      N    X
version branch              
v1      overall  2475.0 -1.0
        A        1712.5  1.0
        B         257.5  2.0
        C         392.5  2.0
        D         112.5  3.0
v2      overall  2475.0 -1.0
        A        2341.5  1.0
        B          95.0  2.0
        C          38.5  2.0
v3      overall  2475.0 -1.0
        A        2000.0  1.0
        B         475.0  2.0
v4      overall  2475.0 -1.0
        A        2341.5  1.0
        B         133.5  1.0

>Solution :

Use:

#select overalls only
overall = df['N'].xs('overall', level=1)
#select all rows without overalls
df1 = df.drop('overall', level=1)

#multiple and aggregate sum, divide overalls   
s = df1['N'].mul(df1['X']).groupby(level=0).sum().div(overall)

#create MultiIndex and assign back
df.loc[pd.IndexSlice[:, 'overall'], 'X'] = pd.concat({'overall':s}).swaplevel(0,1)

print (df)
                      N         X
version branch                   
v1      overall  2475.0  1.353535
        A        1712.5  1.000000
        B         257.5  2.000000
        C         392.5  2.000000
        D         112.5  3.000000
v2      overall  2475.0  1.053939
        A        2341.5  1.000000
        B          95.0  2.000000
        C          38.5  2.000000
v3      overall  2475.0  1.191919
        A        2000.0  1.000000
        B         475.0  2.000000
v4      overall  2475.0  1.000000
        A        2341.5  1.000000
        B         133.5  1.000000
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading