Home Find string matching among columns

Questions

Find string matching among columns

February 28, 2022

I have a DataFrame that looks like this:

id      sentences                                           ind                 tar
0       In samples of depression injected intraneously...   depression        albumin
0       Monomethylmethacrylate in whole blood was asso...   depression        albumin
1       In samples of depression injected intraneously...   depression          hip
1       Monomethylmethacrylate in whole blood was asso...   depression          hip
2       The GVH kinetics and cellular characteristics ...   GVH,GVH,GVH,GVH...  PFC
2       Effects on PFCgeneword responses to thymus-dep...   GVH,GVH,GVH,GVH...  PFC
2       The unresponsive state which developed in GVHg...   GVH,GVH,GVH,GVH...  PFC
2       Furthermore, GVHgeneword spleen cells suppress...   GVH,GVH,GVH,GVH...  PFC
2       This active suppressor effect was found to be ...   GVH,GVH,GVH,GVH...  PFC
2       The delayed transfer of GVHgeneword cells to i...   GVH,GVH,GVH,GVH...  PFC

I want to keep only the rows that have either an ind or a tar value in the corresponding sentence.

The problem is that when I have more than one elements in either ind or tar, even if one of those elements exists on sentence, it doesn’t match it, because it uses the whole string as a term. For example, at the 5th row, even though the word GVH exists in the sentence, it uses as ind the whole value GVH,GVH,GVH,GVH and not each GVH term separately. Can someone help how to fix this issue? Here’s my code so far :

df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]

print(df.sentences.iloc[4], '\n')

print(df.indications.iloc[4], '\n')

print(df.targets.iloc[4], '\n')

print(df.check_ind.iloc[4], '\n')

print(df.check_tar.iloc[4], '\n')


>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. . 

>>>> GVH,GVH,GVH,GVH,GVH,GVH 

>>>> PFC 

>>>> False (This should return TRUE since GVH is in the sentence)

>>>> False

Data:

{'id': [0, 0, 1, 1, 2, 2, 2, 2, 2, 2],
 'sentences': ['In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'The GVH kinetics and cellular characteristics ...',
  'Effects on PFCgeneword responses to thymus-dep...',
  'The unresponsive state which developed in GVHg...',
  'Furthermore, GVHgeneword spleen cells suppress...',
  'This active suppressor effect was found to be ...',
  'The delayed transfer of GVHgeneword cells to i...'],
 'ind': ['depression', 'depression', 'depression',
         'depression', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...'],
 'tar': ['albumin', 'albumin', 'hip', 'hip', 'PFC', 'PFC',
         'PFC', 'PFC', 'PFC', 'PFC']}

>Solution :

Your code is currently treating x.ind as if it were a simple value.

Conceptually x.ind is not a single value, but rather a comma-separated list of values.

In python, you can transform a comma-separated list into an actual python list using x.split(','). In addition, str.strip() is useful to remove possible spaces (for instance, if you have "GVH ,GVH ", the spaces should probably be ignored).

Finally, builtin function any and all are convenient to broadcast a condition to a list.

df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)

byMR

Published February 28, 2022

Add a comment

groupby with totals/subtotals

byMR

February 28, 2022

Questions

Website make defualt first letter in first word of evry sequence in evry element

byMR

February 28, 2022

Questions

How do you check for a type in an array?

byMR

February 28, 2022

Questions

TensorFlow Inner Product Multiplication

byMR

February 28, 2022

Questions

How to append elements to all items in an array

byMR

February 28, 2022

Questions

Query Multiple Tables in Django and geta consolidated result

byMR

February 28, 2022

Find string matching among columns

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

groupby with totals/subtotals

Website make defualt first letter in first word of evry sequence in evry element

How do you check for a type in an array?

TensorFlow Inner Product Multiplication

How to append elements to all items in an array

Query Multiple Tables in Django and geta consolidated result

Keep Up to Date with the Most Important News

Find string matching among columns

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

groupby with totals/subtotals

Website make defualt first letter in first word of evry sequence in evry element

How do you check for a type in an array?

TensorFlow Inner Product Multiplication

How to append elements to all items in an array

Query Multiple Tables in Django and geta consolidated result

Discover more from Dev solutions