Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Select number of values from column based on condition in a different df column

I am working on creating a dummy dataset for testing a cloud storage and dashboard system for a university. I am currently trying to assign courses to each student id for a given term. this would be the course enrollment step in real life. Most students take a full load, 4 classes, and some take 3,2 or 1 class, with decreasing probability.

I have two pandas DataFrames, ‘courses’ and ‘students_master’.

‘courses’ has 1100 rows and looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  subject_id course_id SECTION_SUBJECT        SECTION_SUBJECT_DESC  \
0        HCH   HCH-101            HPCH  Community Health Promotion   
1        HCH   HCH-102            HPCH  Community Health Promotion   
2        HCH   HCH-103            HPCH  Community Health Promotion   
3        HCH   HCH-104            HPCH  Community Health Promotion   
4        HCH   HCH-105            HPCH  Community Health Promotion 

‘students_master’ has 27054 rows and looks like this:

 ID_year_id  cohort      ids  level num_classes
0       22180  2013FA  1001269      4           4
1       49919  2013FA  1000206      4           4
2       48206  2013FA  1000524      4           2
3       40649  2013FA  1000233      4           3
4       29733  2013FA  1000533      4           2

At this point I am trying to create a new column, students_master[‘selections’], where I use the number, 1-4, in the ‘num_classes’ column to randomly select a number of course_ids from courses[‘course_id’]. The resulting column values would be small lists like [HCH-101, TWI-302,…]

When I use this piece of code:

list(courses['course_id'].sample(4))

it works, and results in:

['EVS-406', 'BFN-201', 'ATS-105', 'BOL-103']

I have tried using .apply as well as basic for loops with no luck. I think the most promising method is to ‘vectorize’. So I wrote this .select statement:

selections=[]
conditions = [
        (students_master['num_classes']==4),
        (students_master['num_classes']==3),
        (students_master['num_classes']==2),
        (students_master['num_classes']==1)
]
choices = [
        ([list(courses['course_id'].sample(4))]),
        ([list(courses['course_id'].sample(3))]),
        ([list(courses['course_id'].sample(2))]),
        ([list(courses['course_id'].sample(1))])
]


selections.append(np.select(conditions, choices))

and it gets the error: "shape mismatch: objects cannot be broadcast to a single shape"

Any advice on how to solve this problem is greatly appreciated.

>Solution :

This, you can use apply to ensure the courses are not repeated within each student:

selection = student_master['num_classes'].apply(lambda x: np.random.choice(course['course_id'], x, replace=False) )
                                            
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading