I have an array (pd.Series) of two values (A’s and B’s, for example).
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B
I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.
For example
get_random_stratified_sample_of_indices(y=y, n=4)
[0, 1, 2, 4]
The indices 0 and 2 correspond with the indices of A’s, and the indices of 1 and 4 correspond with the indices of B’s.
Another example
get_random_stratified_sample_of_indices(y=y, n=6)
[1, 4, 5, 0, 2, 3]
The order of the returned list of indices doesn’t matter but I need it to be even split between indices of A’s and B’s from the y array.
My plan was to first look at the indices of A’s, then take a random sample (size=n/2) of the indices. And then repeat for B.
>Solution :
You can use groupby.sample:
N = 4
idx = (y
.index.to_series()
.groupby(y)
.sample(n=N//len(y.unique()))
.to_list()
)
Output: [3, 8, 10, 1]
Check:
3 A
8 A
10 B
1 B
dtype: object