I am doing some statistic of a very large dataframe that takes sums of multiple random samples. I would like the samples to be disjuct (no number should be present in two different samples).
Minimal example that might use some numbers multiple times:
import polars as pl
import numpy as np
df = pl.DataFrame(
{"a": np.random.random(1000)}
)
N_samples = 50
N_logs = 20
sums = [
df.sample(N_logs).select(pl.col("a").sum())[0,0]
for _ in range(N_samples)
]
How to avoid multiple usage of same numbers?
>Solution :
You can sample them all at once using with_replacement = False (which is default) and then aggregate into N_samples sums:
(
df
.sample(N_samples * N_logs)
.group_by(pl.int_range(pl.len()) // N_logs)
.sum()
.get_column("a")
)
shape: (50,)
Series: 'a' [f64]
[
9.993712
10.667377
9.983055
7.092786
10.780031
…
9.384218
8.57084
10.085927
12.77378
10.23612
]