Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compute mean value of rows that has the same column value in Pandas

I’m trying to combine three pandas DataFrames together

One of them (called major) has a column category where each row has a unique label :

major_df = pd.DataFrame(np.random.randint(0, 100, size=(3, 2)), columns=list("AB"))
major_df["category"] = pd.Series(["cat_A", "cat_B", "cat_C"])
    A   B category
0  90  17    cat_A
1  36  81    cat_B
2  90  67    cat_C

Two other dfs (called minor) contains multiple rows and have their own unique column names. Each df has a column category` where each row has a value that is present in the major df category column :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

minor_dfs = {}
for k, cols in zip(("1st", "2nd"), ("CD", "EF")):
    minor_dfs[k] = pd.DataFrame(np.random.randint(0, 100, size=(8, 2)), columns=list(cols))
    minor_dfs[k]["category"] = np.random.choice(["cat_A", "cat_B", "cat_C"], 8)

Here is an example of one of those minor dfs. The only difference between both is that first minor df has the columns C and D, where the second has columns E and F.

    C   D category
0  71  44    cat_C
1   5  88    cat_C
2   8  78    cat_C
3  31  27    cat_C
4  42  48    cat_B
5  18  18    cat_B
6  84  23    cat_A
7  94  23    cat_A

So, my goal is to compute the mean of the values in minor dfs based on the category column, so that at the end, I have the following dfs :

           C      D
cat_A  89.00  23.00
cat_B  30.00  33.00
cat_C  28.75  59.25

where each column contain the mean of the values that are in each category.


For that, I made the following code, where we create empty DataFrames with the column values of the minor dfs and indices from the different values of categories. I then fill this dataframe using a for loop where I iterate over every value of the index.

copy_dfs = {}
for k, min_df in minor_dfs.items():
    # Get columns from minor df
    # Get index from category of major df
    col_names = min_df.columns.values
    ind_values = major_df.category.values

    # Create a df with columns and indices and set values to np.nan
    copy_df = pd.DataFrame(np.nan, index=ind_values, columns=col_names)
    copy_df = copy_df.drop("category", axis=1)

    # For each category in the index of the dataframe
    for maj_category in copy_df.index:
        # Select rows in minor df where category is the same as major df category
        minor_rows = min_df[min_df.category == maj_category]
        minor_rows = minor_rows.drop("category", axis=1)
        # Compute the mean values (by column) of the rows that were selected
        # Add the mean values into copy_df, where the index corresponds to major df category
        copy_df.loc[maj_category] = minor_rows.mean()

    # Store into dict
    copy_dfs[k] = copy_df

Yet, I think that this code could be optimized using vectorized operations, especially in the part where I iterate for each row. So I was wondering if there was a easier and clever way to accomplish what I’m trying to do ?

>Solution :

This?

import pandas as pd

df = pd.read_excel('test.xlsx')
df1 = df.groupby(['category']).mean()
print(df)
print(df1)

output:

    C   D category
0  71  44    cat_C
1   5  88    cat_C
2   8  78    cat_C
3  31  27    cat_C
4  42  48    cat_B
5  18  18    cat_B
6  84  23    cat_A
7  94  23    cat_A


              C      D
category
cat_A     89.00  23.00
cat_B     30.00  33.00
cat_C     28.75  59.25
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading