Filtering long format Pandas DF based on conditions from the dictionary

July 22, 2022

Imagine I have an order for specialists in some coding languages with multiple criterion in JSON format:

request = {'languages_required': {'Python': 4,
                                  'Java': 2},
           'other_requests': []
          }

languages_required means that the candidate must have a skill in the language and the number is the minimum level of this language.

The format of candidates dataframe is long:

df = pd.DataFrame({'candidate': ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'd'],
                  'language': ['Python', 'Java', 'Scala', 'Python', 'R', 'Python', 'Java', 'Python', 'Scala', 'Java'],
                  'skill': [5, 4, 4, 6, 8, 1, 3, 5, 2, 2]})

That gives:


    candidate   language    skill
0       a       Python      5
1       a       Java        4
2       a       Scala       4
3       b       Python      6
4       b       R           8
5       c       Python      1
6       c       Java        3
7       d       Python      5
8       d       Scala       2
9       d       Java        2

What I need to do is to keep the candidates and their skills in required languages that meet the requirements from the request, that is:

Have skills in both mentioned languages
Skills in these languages are equal or higher than values in the dictionary

So the desired output would be:


    candidate   language    skill
0       a       Python      5
1       a       Java        4
7       d       Python      5
9       d       Java        2

I am able to filter the candidates with the languages based on keys() of the dictionary:

lang_mask = df[df['language'].isin(request['languages_required'].keys())]\
                                                                         .groupby('candidate')['language']\
                                                                         .apply(lambda x: set(request['languages_required']).issubset(x))

…but struggle with adding the ‘is higher than’ per language condition. I would really appreciate some help.

>Solution :

You need call first condition in one step and then second in another step:

df = df[df['language'].map(request['languages_required']).le(df['skill'])]
df = df[df.groupby('candidate')['language'].transform(lambda x: set(request['languages_required']).issubset(x))]
print (df)
  candidate language  skill
0         a   Python      5
1         a     Java      4
7         d   Python      5
9         d     Java      2

Or one row solution:

df = (df[df['language'].map(request['languages_required']).le(df['skill'])]
      .pipe(lambda x: x[x.groupby('candidate')['language'].transform(lambda x: set(request['languages_required']).issubset(x))]))

print (df)
  candidate language  skill
0         a   Python      5
1         a     Java      4
7         d   Python      5
9         d     Java      2