I have a DataFrame that looks like this:
id sentences ind tar
0 In samples of depression injected intraneously... depression albumin
0 Monomethylmethacrylate in whole blood was asso... depression albumin
1 In samples of depression injected intraneously... depression hip
1 Monomethylmethacrylate in whole blood was asso... depression hip
2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH... PFC
2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH... PFC
2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH... PFC
2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH... PFC
2 This active suppressor effect was found to be ... GVH,GVH,GVH,GVH... PFC
2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH... PFC
I want to keep only the rows that have either an ind or a tar value in the corresponding sentence.
The problem is that when I have more than one elements in either ind or tar, even if one of those elements exists on sentence, it doesn’t match it, because it uses the whole string as a term. For example, at the 5th row, even though the word GVH exists in the sentence, it uses as ind the whole value GVH,GVH,GVH,GVH and not each GVH term separately. Can someone help how to fix this issue? Here’s my code so far :
df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]
print(df.sentences.iloc[4], '\n')
print(df.indications.iloc[4], '\n')
print(df.targets.iloc[4], '\n')
print(df.check_ind.iloc[4], '\n')
print(df.check_tar.iloc[4], '\n')
>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. .
>>>> GVH,GVH,GVH,GVH,GVH,GVH
>>>> PFC
>>>> False (This should return TRUE since GVH is in the sentence)
>>>> False
Data:
{'id': [0, 0, 1, 1, 2, 2, 2, 2, 2, 2],
'sentences': ['In samples of depression injected intraneously...',
'Monomethylmethacrylate in whole blood was asso...',
'In samples of depression injected intraneously...',
'Monomethylmethacrylate in whole blood was asso...',
'The GVH kinetics and cellular characteristics ...',
'Effects on PFCgeneword responses to thymus-dep...',
'The unresponsive state which developed in GVHg...',
'Furthermore, GVHgeneword spleen cells suppress...',
'This active suppressor effect was found to be ...',
'The delayed transfer of GVHgeneword cells to i...'],
'ind': ['depression', 'depression', 'depression',
'depression', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...'],
'tar': ['albumin', 'albumin', 'hip', 'hip', 'PFC', 'PFC',
'PFC', 'PFC', 'PFC', 'PFC']}
>Solution :
Your code is currently treating x.ind as if it were a simple value.
Conceptually x.ind is not a single value, but rather a comma-separated list of values.
In python, you can transform a comma-separated list into an actual python list using x.split(','). In addition, str.strip() is useful to remove possible spaces (for instance, if you have "GVH ,GVH ", the spaces should probably be ignored).
Finally, builtin function any and all are convenient to broadcast a condition to a list.
df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)