Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I efficiently get multiple slices out of a large dataset?

I want to get multiple small slices out of a large time-series dataset (~ 25GB, ~800M rows). At the moment, this looks something like this:

from polars import pl

sample = pl.scan_csv(FILENAME, new_columns=["time", "force"]).slice(660_000_000, 3000).collect()

This code takes about 0-5 minutes, depending on the position of the slice I want to get. If I want 5 slices, this takes maybe 15 minutes to run everything. However, since polars is reading the whole csv anyhow, I was wondering if there is a way to get all my slices I want in one go, so polars only has to read the csv once.

Chaining multiple slices (obviously) doesn’t work, maybe there is some other way?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You should be able to run them all in parallel with .collect_all()

lf = pl.scan_csv(FILENAME, new_columns=["time", "force"])

samples = [
    lf.slice(660_000_000, 3000),
    lf.slice(890_000_000, 5000),
    ...
]

samples = pl.collect_all(samples)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading