Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filtering long format Pandas DF based on conditions from the dictionary

Imagine I have an order for specialists in some coding languages with multiple criterion in JSON format:

request = {'languages_required': {'Python': 4,
                                  'Java': 2},
           'other_requests': []
          }

languages_required means that the candidate must have a skill in the language and the number is the minimum level of this language.

The format of candidates dataframe is long:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = pd.DataFrame({'candidate': ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'd'],
                  'language': ['Python', 'Java', 'Scala', 'Python', 'R', 'Python', 'Java', 'Python', 'Scala', 'Java'],
                  'skill': [5, 4, 4, 6, 8, 1, 3, 5, 2, 2]})

That gives:


    candidate   language    skill
0       a       Python      5
1       a       Java        4
2       a       Scala       4
3       b       Python      6
4       b       R           8
5       c       Python      1
6       c       Java        3
7       d       Python      5
8       d       Scala       2
9       d       Java        2

What I need to do is to keep the candidates and their skills in required languages that meet the requirements from the request, that is:

  1. Have skills in both mentioned languages
  2. Skills in these languages are equal or higher than values in the dictionary

So the desired output would be:


    candidate   language    skill
0       a       Python      5
1       a       Java        4
7       d       Python      5
9       d       Java        2

I am able to filter the candidates with the languages based on keys() of the dictionary:

lang_mask = df[df['language'].isin(request['languages_required'].keys())]\
                                                                         .groupby('candidate')['language']\
                                                                         .apply(lambda x: set(request['languages_required']).issubset(x))

…but struggle with adding the ‘is higher than’ per language condition. I would really appreciate some help.

>Solution :

You need call first condition in one step and then second in another step:

df = df[df['language'].map(request['languages_required']).le(df['skill'])]
df = df[df.groupby('candidate')['language'].transform(lambda x: set(request['languages_required']).issubset(x))]
print (df)
  candidate language  skill
0         a   Python      5
1         a     Java      4
7         d   Python      5
9         d     Java      2

Or one row solution:

df = (df[df['language'].map(request['languages_required']).le(df['skill'])]
      .pipe(lambda x: x[x.groupby('candidate')['language'].transform(lambda x: set(request['languages_required']).issubset(x))]))

print (df)
  candidate language  skill
0         a   Python      5
1         a     Java      4
7         d   Python      5
9         d     Java      2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading