How can I reduce the amount of data in a polars DataFrame?

August 23, 2023

I have a csv file with a size of 28 GB, which I want to plot. Those are way too many data points obviously, so how can I reduce the data? I would like to merge about 1000 data points into one by calculating the mean. This is the sturcture of my DataFrame:

Time in seconds	Force in N
f64	f64
0.0	2310.18
0.0005	2313.23
0.001	2314.14

I thought about using groupby_dynamic, and then calculating the mean of each group, but this only seems to work when using datetimes? The time in seconds is given as a float however.

>Solution :

You can also group by an integer column to create groups of size N:

In case of a groupby_dynamic on an integer column, the windows are defined by:

“1i” # length 1

“10i” # length 10

We can use .int_range() to add an integer row count to group on:

df = pl.DataFrame({"force": ["A", "B", "C", "D", "E", "F", "G"]})

(df.with_columns(row_nr = pl.int_range(0, pl.count()))
   .groupby_dynamic(
      index_column = "row_nr",
      every = "2i" 
   )
   .agg("force")
)

shape: (4, 2)
┌────────┬────────────┐
│ row_nr ┆ force      │
│ ---    ┆ ---        │
│ i64    ┆ list[str]  │
╞════════╪════════════╡
│ 0      ┆ ["A", "B"] │
│ 2      ┆ ["C", "D"] │
│ 4      ┆ ["E", "F"] │
│ 6      ┆ ["G"]      │
└────────┴────────────┘