Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find string matching among columns

I have a DataFrame that looks like this:

id      sentences                                           ind                 tar
0       In samples of depression injected intraneously...   depression        albumin
0       Monomethylmethacrylate in whole blood was asso...   depression        albumin
1       In samples of depression injected intraneously...   depression          hip
1       Monomethylmethacrylate in whole blood was asso...   depression          hip
2       The GVH kinetics and cellular characteristics ...   GVH,GVH,GVH,GVH...  PFC
2       Effects on PFCgeneword responses to thymus-dep...   GVH,GVH,GVH,GVH...  PFC
2       The unresponsive state which developed in GVHg...   GVH,GVH,GVH,GVH...  PFC
2       Furthermore, GVHgeneword spleen cells suppress...   GVH,GVH,GVH,GVH...  PFC
2       This active suppressor effect was found to be ...   GVH,GVH,GVH,GVH...  PFC
2       The delayed transfer of GVHgeneword cells to i...   GVH,GVH,GVH,GVH...  PFC

I want to keep only the rows that have either an ind or a tar value in the corresponding sentence.

The problem is that when I have more than one elements in either ind or tar, even if one of those elements exists on sentence, it doesn’t match it, because it uses the whole string as a term. For example, at the 5th row, even though the word GVH exists in the sentence, it uses as ind the whole value GVH,GVH,GVH,GVH and not each GVH term separately. Can someone help how to fix this issue? Here’s my code so far :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]

print(df.sentences.iloc[4], '\n')

print(df.indications.iloc[4], '\n')

print(df.targets.iloc[4], '\n')

print(df.check_ind.iloc[4], '\n')

print(df.check_tar.iloc[4], '\n')


>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. . 

>>>> GVH,GVH,GVH,GVH,GVH,GVH 

>>>> PFC 

>>>> False (This should return TRUE since GVH is in the sentence)

>>>> False 

Data:

{'id': [0, 0, 1, 1, 2, 2, 2, 2, 2, 2],
 'sentences': ['In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'The GVH kinetics and cellular characteristics ...',
  'Effects on PFCgeneword responses to thymus-dep...',
  'The unresponsive state which developed in GVHg...',
  'Furthermore, GVHgeneword spleen cells suppress...',
  'This active suppressor effect was found to be ...',
  'The delayed transfer of GVHgeneword cells to i...'],
 'ind': ['depression', 'depression', 'depression',
         'depression', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...'],
 'tar': ['albumin', 'albumin', 'hip', 'hip', 'PFC', 'PFC',
         'PFC', 'PFC', 'PFC', 'PFC']}

>Solution :

Your code is currently treating x.ind as if it were a simple value.

Conceptually x.ind is not a single value, but rather a comma-separated list of values.

In python, you can transform a comma-separated list into an actual python list using x.split(','). In addition, str.strip() is useful to remove possible spaces (for instance, if you have "GVH ,GVH ", the spaces should probably be ignored).

Finally, builtin function any and all are convenient to broadcast a condition to a list.

df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading