Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Properly groupby and filter with Polar

I have df for my work with 3 main columns: cid1, cid2, cid3, and more 7 columns cid4, cid5, etc.

cid1 and cid2 is int, another columns is float.

Each combitations of cid1 and cid2 is a workset with some rows where is values of all other columns is different. I want to filter df and receive my df with only max values in column cid3 for each combination of cid1 and cid2. cid4 and next columns must be leaved without changes.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This code helps me with one part of my task:

df = (df
    .groupby(["cid1", "cid2"])
    .agg([pl.max("cid3").alias("max_cid3")])
)

It’s receives only 3 columns: cid1, cid2, max_cid3 and filter all rows when cid3 is not maximal.
But I can’t find how to receive all another columns (cid4, etc) for that rows without changes.

df = (df
    .groupby(["cid1", "cid2"])
    .agg([pl.max("cid3").alias("max_cid3"), pl.col("cid4")])
)

I tried to add pl.col("cid4") to list of aggs but in column I see as values different lists of some cid4 values.

How I can make it properly? Maybe Polars haves another way to make it then groupby?

In Pandas I can make it:

import pandas as pd
import numpy as np

df["max_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.max)

And then filter df wherever cid3==max_cid3
But I can’t find a way to make it in Polars.

Thank you!

>Solution :

In polars you can use a Window function

df.with_column(
   pl.col("cid3").max().over(["cid1", "cid2"])
     .alias("max_cid3")
)
shape: (5, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ max_cid3 β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---      β”‚
β”‚ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64      β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ══════β•ͺ══════β•ͺ══════════║
β”‚ 1    ┆ 1    ┆ 1    ┆ 4    ┆ 4    ┆ 1        β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 2    ┆ 2    ┆ 2    ┆ 5    ┆ 5    ┆ 9        β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 2    ┆ 2    ┆ 9    ┆ 6    ┆ 4    ┆ 9        β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 1    ┆ 1    ┆ 1    ┆ 7    ┆ 9    ┆ 1        β”‚
β”œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”Όβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ”€
β”‚ 3    ┆ 3    ┆ 1    ┆ 8    ┆ 3    ┆ 1        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You could also put it directly inside .filter()

df.filter(
    pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])
)

Data used:

df = pl.DataFrame({
   "cid1": [1, 2, 2, 1, 3],
   "cid2": [1, 2, 2, 1, 3],
   "cid3": [1, 2, 9, 1, 1],
   "cid4": [4, 5, 6, 7, 8],
   "cid5": [4, 5, 4, 9, 3],
})
>>> df.to_pandas().groupby(["cid1", "cid2"])["cid3"].transform("max")
0    1
1    9
2    9
3    1
4    1
Name: cid3, dtype: int64
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading