‘Is there a way to replicate the groupby -> transform functionality in Polars?
I use this for normalising over groups (make groups sum to one) i.e.
df['normalised'] = df.groupby[*groupcols*].transform(lambda x: x/x.sum())
which I did a work around in polars by adding a column for the sum then dividing by that column:
df.join((df.groupby(by=[*groupcols*]).agg((pl.col('VOL').sum()).alias('VOLSUM'))),
left_on = [*groupcols*],
right_on = [*groupcols*]).with_column(
(pl.col('VOL') / pl.col('VOLSUM')).alias('VOLNORM')
).drop('VOLSUM')
However, another one I need is cummulative sum within groups:
df['cummulativesum'] = df.groupby[*groupcols*][col].transform('cumsum')
2 questions for Polars people: is my normalisation over groups the best way? is there a way to do a polars equivalent of cumsum within groups in polars??
Thanks in advance!
>Solution :
You can use Window Expressions
DataFrame:
>>> df = pl.DataFrame(dict(GROUP=[1, 2, 2, 1, 3], VOL=[1, 2, 1, 5, 1]))
>>> df
shape: (5, 2)
┌───────┬─────┐
│ GROUP | VOL │
│ --- | --- │
│ i64 | i64 │
╞═══════╪═════╡
│ 1 | 1 │
├───────┼─────┤
│ 2 | 2 │
├───────┼─────┤
│ 2 | 1 │
├───────┼─────┤
│ 1 | 5 │
├───────┼─────┤
│ 3 | 1 │
└─//────┴─//──┘
.with_columns() and .over():
>>> df.with_columns([
... (pl.col("VOL") / pl.col("VOL").sum()).over("GROUP").alias("VOLNORM"),
... pl.col("VOL").cumsum().over("GROUP").alias("VOLCUMSUM")
... ])
shape: (5, 4)
┌───────┬─────┬──────────┬───────────┐
│ GROUP | VOL | VOLNORM | VOLCUMSUM │
│ --- | --- | --- | --- │
│ i64 | i64 | f64 | i64 │
╞═══════╪═════╪══════════╪═══════════╡
│ 1 | 1 | 0.166667 | 1 │
├───────┼─────┼──────────┼───────────┤
│ 2 | 2 | 0.666667 | 2 │
├───────┼─────┼──────────┼───────────┤
│ 2 | 1 | 0.333333 | 3 │
├───────┼─────┼──────────┼───────────┤
│ 1 | 5 | 0.833333 | 6 │
├───────┼─────┼──────────┼───────────┤
│ 3 | 1 | 1.0 | 1 │
└─//────┴─//──┴─//───────┴─//────────┘