Here is a piece of code for Polars library along with some test data:
import polars as pl
data = {'test': [0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1]}
df = pl.DataFrame(data)
I want to achieve the following result:
[0, 1, 2, 0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 3, 4, 0, 1, 0, 0, 1, 2, 0, 0, 0, 1, 2, 3, 4, 0, 1, 0, 0, 1]
The desired result is to keep the original 0 values unchanged, start accumulating the consecutive 1’s, and reset the count to the initial value when encountering a 0 value.
The end result is still data of type pl.DataFrame.
The amount of data is so large that a syntax similar to a for loop cannot be used.
What should I do if I am required to use only polars functions and not numpy or other libraries?
>Solution :
One way to break up the output data is that it is a cumulative count by group, with a new group starting every time a 0 appears in the input data. In that way you can build the following expression:
df.with_columns(
pl.col("test")
.cumcount()
.over(pl.when(pl.col("test") == 0).then(1).cumsum().forward_fill())
)
The cumsum in the over expression on a flat 1 literal column, along with filling the nulls appropriately, creates the groups we need.