I want to get multiple small slices out of a large time-series dataset (~ 25GB, ~800M rows). At the moment, this looks something like this:
from polars import pl
sample = pl.scan_csv(FILENAME, new_columns=["time", "force"]).slice(660_000_000, 3000).collect()
This code takes about 0-5 minutes, depending on the position of the slice I want to get. If I want 5 slices, this takes maybe 15 minutes to run everything. However, since polars is reading the whole csv anyhow, I was wondering if there is a way to get all my slices I want in one go, so polars only has to read the csv once.
Chaining multiple slices (obviously) doesn’t work, maybe there is some other way?
>Solution :
You should be able to run them all in parallel with .collect_all()
lf = pl.scan_csv(FILENAME, new_columns=["time", "force"])
samples = [
lf.slice(660_000_000, 3000),
lf.slice(890_000_000, 5000),
...
]
samples = pl.collect_all(samples)