Question
The minimum reproducible example of my data frame looks like this
df = pd.DataFrame({'patient': ['patient1', 'patient1', 'patient1','patient2', 'patient2', 'patient3','patient3','patient4','patient4','patient4','patient4'],
'gene':['TYR','TYR','TYR','TYR','TYR','TYR','TYR','TYR','TYR', 'TYR','TYR'],
'variant': ['buu', 'luu', 'stm','lol', 'bla', 'buu', 'lol','buu', 'luu', 'IDK','ploy'],
'genotype': ['hom', 'het', 'hom','het', 'hom', 'het', 'het','het', 'hom', 'hom','hom']})
df
patient gene variant genotype
0 patient1 TYR buu hom
1 patient1 TYR luu het
2 patient1 TYR stm hom
3 patient2 TYR lol het
4 patient2 TYR bla hom
5 patient3 TYR buu het
6 patient3 TYR lol het
7 patient4 TYR ploy het
8 patient4 TYR luu hom
9 patient4 TYR IDK hom
10 patient4 TYR buu hom
I need to identify patients with the variants "buu" and "luu"
Results
patient1 TYR buu hom
patient1 TYR luu het
patient4 TYR luu hom
patient4 TYR buu hom
>Solution :
Group by patient and filter groups keeping those whose variant contains both required variants (buu, luu):
var_set = {'buu', 'luu'} # set of variants
df[df['variant'].isin(var_set)].groupby('patient')\
.filter(lambda x: set(x['variant']) >= var_set)
set(A) >= set(B)givesTruewhen setAis a superset of setB
patient gene variant genotype
0 patient1 TYR buu hom
1 patient1 TYR luu het
7 patient4 TYR buu het
8 patient4 TYR luu hom