Home Select number of values from column based on condition in a different df column

Questions

Select number of values from column based on condition in a different df column

December 20, 2021

I am working on creating a dummy dataset for testing a cloud storage and dashboard system for a university. I am currently trying to assign courses to each student id for a given term. this would be the course enrollment step in real life. Most students take a full load, 4 classes, and some take 3,2 or 1 class, with decreasing probability.

I have two pandas DataFrames, ‘courses’ and ‘students_master’.

‘courses’ has 1100 rows and looks like this:

  subject_id course_id SECTION_SUBJECT        SECTION_SUBJECT_DESC  \
0        HCH   HCH-101            HPCH  Community Health Promotion   
1        HCH   HCH-102            HPCH  Community Health Promotion   
2        HCH   HCH-103            HPCH  Community Health Promotion   
3        HCH   HCH-104            HPCH  Community Health Promotion   
4        HCH   HCH-105            HPCH  Community Health Promotion

‘students_master’ has 27054 rows and looks like this:

 ID_year_id  cohort      ids  level num_classes
0       22180  2013FA  1001269      4           4
1       49919  2013FA  1000206      4           4
2       48206  2013FA  1000524      4           2
3       40649  2013FA  1000233      4           3
4       29733  2013FA  1000533      4           2

At this point I am trying to create a new column, students_master[‘selections’], where I use the number, 1-4, in the ‘num_classes’ column to randomly select a number of course_ids from courses[‘course_id’]. The resulting column values would be small lists like [HCH-101, TWI-302,…]

When I use this piece of code:

list(courses['course_id'].sample(4))

it works, and results in:

['EVS-406', 'BFN-201', 'ATS-105', 'BOL-103']

I have tried using .apply as well as basic for loops with no luck. I think the most promising method is to ‘vectorize’. So I wrote this .select statement:

selections=[]
conditions = [
        (students_master['num_classes']==4),
        (students_master['num_classes']==3),
        (students_master['num_classes']==2),
        (students_master['num_classes']==1)
]
choices = [
        ([list(courses['course_id'].sample(4))]),
        ([list(courses['course_id'].sample(3))]),
        ([list(courses['course_id'].sample(2))]),
        ([list(courses['course_id'].sample(1))])
]


selections.append(np.select(conditions, choices))

and it gets the error: "shape mismatch: objects cannot be broadcast to a single shape"

Any advice on how to solve this problem is greatly appreciated.

>Solution :

This, you can use apply to ensure the courses are not repeated within each student:

selection = student_master['num_classes'].apply(lambda x: np.random.choice(course['course_id'], x, replace=False) )

byMR

Published December 20, 2021

Add a comment

Get Centroid from Sequence in N-Dimensions

byMR

December 20, 2021

Questions

How to add column to an existing SQL Server table based on condition?

byMR

December 20, 2021

Questions

Read the coefficients a,b,c of the quadratic equation ax^2+bx+c and print it roots nicely for imaginary roots print in x+iy form

byMR

December 20, 2021

Questions

applying rmultinom in R without iterating over matrix

byMR

December 20, 2021

Questions

The same phrase is repeating while using different command

byMR

December 20, 2021

Questions

Are Kafka Streams Appropriate for Triggering Batch Processing of Records?

byMR

December 20, 2021

Select number of values from column based on condition in a different df column

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Get Centroid from Sequence in N-Dimensions

How to add column to an existing SQL Server table based on condition?

Read the coefficients a,b,c of the quadratic equation ax^2+bx+c and print it roots nicely for imaginary roots print in x+iy form

applying rmultinom in R without iterating over matrix

The same phrase is repeating while using different command

Are Kafka Streams Appropriate for Triggering Batch Processing of Records?

Keep Up to Date with the Most Important News

Select number of values from column based on condition in a different df column

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Get Centroid from Sequence in N-Dimensions

How to add column to an existing SQL Server table based on condition?

Read the coefficients a,b,c of the quadratic equation ax^2+bx+c and print it roots nicely for imaginary roots print in x+iy form

applying rmultinom in R without iterating over matrix

The same phrase is repeating while using different command

Are Kafka Streams Appropriate for Triggering Batch Processing of Records?

Discover more from Dev solutions