Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to method chaining in pandas to aggregate a DataFrame?

I want to aggregate a pandas DataFrame using method chaining. I don’t know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:

Having this data:

import pandas as pd

my_df = pd.DataFrame({
    'name': ['john', 'diana', 'rachel', 'chris'],
    'favorite_color': ['red', 'blue', 'green', 'red']
})

my_df
#>      name favorite_color
#> 0    john            red
#> 1   diana           blue
#> 2  rachel          green
#> 3   chris            red

and I want to end up with this summary table:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#>    total_people  total_ppl_who_like_red
#> 0             4                       2

Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:

desired_output_via_building_new_df = pd.DataFrame({
    'total_people': [len(my_df)],
    'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
    })

desired_output_via_building_new_df 
#>    total_people  total_ppl_who_like_red
#> 0             4                       2

However, I’m looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of

# pseudo-code; not really working
my_df.agg({
    'total_people': lambda x: len(x),
    'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
    })

I can only borrow inspiration from R/dplyr code:

library(dplyr, warn.conflicts = FALSE)

my_df <- 
  data.frame(name = c("john", "diana", "rachel", "chris"),
             favorite_color = c("red", "blue", "green", "red")
             )

my_df |> 
  summarise(total_people = n(),                                     ## in the context of `summarise()`,
            total_ppl_who_like_red = sum(favorite_color == "red"))  ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#>   total_people total_ppl_who_like_red
#> 1            4                      2

>Solution :

Solution for processing one Series:

df = my_df.favorite_color.apply({
    'total_people': lambda x: x.count(),
    'total_ppl_who_like_red': lambda x: x.eq('red').sum()
    }).to_frame(name=0).T
print (df)
   total_people  total_ppl_who_like_red
0             4                       2

General solution for processing DataFrame with DataFrame.pipe – then pandas processing input DataFrame, if use apply or agg processing columns separately:

df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x), 
                                     'total_ppl_who_like_red': 
                                        x.favorite_color.eq('red').sum()}))
           .to_frame(name=0).T)
print (df)
   total_people  total_ppl_who_like_red
0             4                       2

df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x), 
                           'total_ppl_who_like_red': 
                               x.favorite_color.eq('red').sum(),
                           'max_age':x.age.max()
                          }).to_frame(name=0).T)
print (df)
   total_people  total_ppl_who_like_red  max_age
0             4                       2       41
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading