Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Polars dataframe: rolling_sum look ahead

I want to calculate rolling_sum, but not over x rows above the current row, but over the x rows below the current row.

My solution is to sort the dataframe with descending=True before applying the rolling_sum and sort back to descending=False.

My solution:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import polars as pl

# Dummy dataset
df = pl.DataFrame({
        "Date": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
        "Close": [-1, 1, 2, 3, 4, 4, 3, 2, 1, -1],
        "Company": ["A", "A", "A","A", "A",  "B", "B", "B", "B", "B"]
    })

# Solution using sort twice

(
    df
    .sort(by=["Company", "Date"], descending=[True, True])
    .with_columns(
        pl.col("Close").rolling_sum(3).over("Company").alias("Cumsum_lead")
    )
    .sort(by=["Company", "Date"], descending=[False, False])
)

Is there a better solution?

With better I mean:

  • more computational efficient and/or
  • less code / easier to read

Thanks!

EDIT:

I just thought of one other solution which is avoids sorting / reversing the column altogether: using shift

(
    df
    .with_columns(
        pl.col("Close")
      .rolling_sum(3)
      .shift(-2)
      .over("Company").alias("Cumsum_lead"))
)

>Solution :

You can avoid sorting the rows and instead reverse the specific column twice using pl.Expr.reverse.

(
    df
    .with_columns(
        pl.col("Close")
        .reverse().rolling_sum(3).reverse()
        .over("Company").alias("Cumsum_lead")
    )
)

For readability, this could also be wrapped into a helper function.

def rolling_sum_lead(expr: pl.Expr, window_size: int) -> pl.Expr:
    return expr.reverse().rolling_sum(window_size).reverse()

(
    df
    .with_columns(
        rolling_sum_lead(pl.col("Close"), 3).over("Company").alias("Cumsum_lead")
    )
)

Note. On my machine, this takes 124 µs ± 5.67 µs per loop in contrast to 205 µs ± 6.9 µs per loop for the solution using pl.DataFrame.sort.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading