Get x number of rows in a dataframe at equally spaced index

August 9, 2023

I have a dataframe that looks something like

            time        value1        value2
    1    1000000    1000009842    1009809435
    2    1000032    2348974923    2343242342
    3    1000342    2342345320    2342342234
    ...
    1000 4324342    2131242353    4234234234

I want to get 20 random values where the index are spaced uniformly, indexes

10, 20, 30, 40, 50... 200

400, 420, 440, 460... 800

Where the index starts from can be random, the only thing that needs to be constant is the index between each returned column.

I’ve used

df.sample(1000)

to get a sample of 1000 columns but don’t see a way of distributing the indexes equally?

>Solution :

Use df.iloc[slice_idx] for this, with slice_idx an array that starts at a random start index and has a constant index width.

E.g.:

width = 10
idx0 = np.random.randint(0, len(df))
slice_idx = np.arange(idx0 , len(df), width)
df.iloc[slice_idx]

returns the rows idx0 , idx0+10, idx0+20, idx0+30, idx0+40, idx0+50, ...

A thing to consider is the minimal length of the array. This can be assured by selecting idx0 below a certain limit. E.g.:

width = 20
min_elements = 10  # minimal number of selected elements
assert len(df) > min_elements * width  # assure the parameters validity

# select idx0 between 0 and len(df) - (min_elements - 1) * width - 1
idx0 = np.random.randint(0, len(df) - (min_elements - 1) * width)
slice_idx = np.arange(idx0 , len(df), width)
df.iloc[slice_idx]  # has at least `min_elements` items