Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get a stratified random sample of indices?

I have an array (pd.Series) of two values (A’s and B’s, for example).

y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])


0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B

I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.

For example

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

get_random_stratified_sample_of_indices(y=y, n=4)

[0, 1, 2, 4]

The indices 0 and 2 correspond with the indices of A’s, and the indices of 1 and 4 correspond with the indices of B’s.

Another example

get_random_stratified_sample_of_indices(y=y, n=6)

[1, 4, 5, 0, 2, 3]

The order of the returned list of indices doesn’t matter but I need it to be even split between indices of A’s and B’s from the y array.

My plan was to first look at the indices of A’s, then take a random sample (size=n/2) of the indices. And then repeat for B.

>Solution :

You can use groupby.sample:

N = 4

idx = (y
  .index.to_series()
  .groupby(y)
  .sample(n=N//len(y.unique()))
  .to_list()
 )

Output: [3, 8, 10, 1]

Check:

3     A
8     A
10    B
1     B
dtype: object
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading