Generate multiple disjunct samples from a dataframe

November 28, 2024

I am doing some statistic of a very large dataframe that takes sums of multiple random samples. I would like the samples to be disjuct (no number should be present in two different samples).

Minimal example that might use some numbers multiple times:

import polars as pl
import numpy as np

df = pl.DataFrame(
    {"a": np.random.random(1000)}
    )

N_samples = 50
N_logs = 20
sums = [
        df.sample(N_logs).select(pl.col("a").sum())[0,0]
        for _ in range(N_samples)
        ]

How to avoid multiple usage of same numbers?

>Solution :

You can sample them all at once using with_replacement = False (which is default) and then aggregate into N_samples sums:

(
    df
    .sample(N_samples * N_logs)
    .group_by(pl.int_range(pl.len()) // N_logs)
    .sum()
    .get_column("a")
)

shape: (50,)
Series: 'a' [f64]
[
    9.993712
    10.667377
    9.983055
    7.092786
    10.780031
    …
    9.384218
    8.57084
    10.085927
    12.77378
    10.23612
]