How can I efficiently get multiple slices out of a large dataset?

October 9, 2023

I want to get multiple small slices out of a large time-series dataset (~ 25GB, ~800M rows). At the moment, this looks something like this:

from polars import pl

sample = pl.scan_csv(FILENAME, new_columns=["time", "force"]).slice(660_000_000, 3000).collect()

This code takes about 0-5 minutes, depending on the position of the slice I want to get. If I want 5 slices, this takes maybe 15 minutes to run everything. However, since polars is reading the whole csv anyhow, I was wondering if there is a way to get all my slices I want in one go, so polars only has to read the csv once.

Chaining multiple slices (obviously) doesn’t work, maybe there is some other way?

>Solution :

You should be able to run them all in parallel with .collect_all()

lf = pl.scan_csv(FILENAME, new_columns=["time", "force"])

samples = [
    lf.slice(660_000_000, 3000),
    lf.slice(890_000_000, 5000),
    ...
]

samples = pl.collect_all(samples)