I want to get multiple small slices out of a large time-series dataset (~ 25GB, ~800M rows). At the moment, this looks something like this:
from polars import pl sample = pl.scan_csv(FILENAME, new_columns=["time", "force"]).slice(660_000_000, 3000).collect()
This code takes about 0-5 minutes, depending on the position of the slice I want to get. If I want 5 slices, this takes maybe 15 minutes to run everything. However, since polars is reading the whole csv anyhow, I was wondering if there is a way to get all my slices I want in one go, so polars only has to read the csv once.
Chaining multiple slices (obviously) doesn’t work, maybe there is some other way?
You should be able to run them all in parallel with
lf = pl.scan_csv(FILENAME, new_columns=["time", "force"]) samples = [ lf.slice(660_000_000, 3000), lf.slice(890_000_000, 5000), ... ] samples = pl.collect_all(samples)