Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Generate multiple disjunct samples from a dataframe

I am doing some statistic of a very large dataframe that takes sums of multiple random samples. I would like the samples to be disjuct (no number should be present in two different samples).

Minimal example that might use some numbers multiple times:

import polars as pl
import numpy as np

df = pl.DataFrame(
    {"a": np.random.random(1000)}
    )

N_samples = 50
N_logs = 20
sums = [
        df.sample(N_logs).select(pl.col("a").sum())[0,0]
        for _ in range(N_samples)
        ]

How to avoid multiple usage of same numbers?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can sample them all at once using with_replacement = False (which is default) and then aggregate into N_samples sums:

(
    df
    .sample(N_samples * N_logs)
    .group_by(pl.int_range(pl.len()) // N_logs)
    .sum()
    .get_column("a")
)
shape: (50,)
Series: 'a' [f64]
[
    9.993712
    10.667377
    9.983055
    7.092786
    10.780031
    …
    9.384218
    8.57084
    10.085927
    12.77378
    10.23612
]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading