Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find all sentences containing specific words

I have a string consisting of sentences and want to find all sentences that contain at least one specific keyword, i.e. keyword1 or keyword2:

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^\.!?].*(keyword1)|(keyword2).*[\.!?])\s")
for match in pattern.findall(s):
    print(match)

Output:

('This is a sentence which contains keyword1', 'keyword1', '')
('keyword2 is inside this sentence. ', '', 'keyword2')

Expected Output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

('This is a sentence which contains keyword1', 'keyword1', '')
('And keyword2 is inside this sentence. ', '', 'keyword2')

As you can see, the second match doesn’t contain the whole sentence in the first group. What am I missing here?

>Solution :

You can use a negated character class to not match . ! and ? and put the keywords in the same group to prevent the empty string in the result.

Then re.findall returns the capture group values, which is group 1 for the whole match, and group 2, 3 etc.. for one of the keywords.

([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s

Explanation

  • ( Capture group 1
    • [A-Z][^.!?]* Match an uppercase char A-Z and optionally any char except one of .!?
    • (?:(keyword1)|(keyword2)) Capture one of the keywords in their own group
    • [^.!?]*[.!?] Match any char except one of .!? and then match one of .!?
  • ) Close group 1
  • \s Match a whitespace char

See a regex demo and a Python demo.

Example

import re

s = "This is a sentence which contains keyword1. And keyword2 is inside this sentence. "

pattern = re.compile(r"([A-Z][^.!?]*(?:(keyword1)|(keyword2))[^.!?]*[.!?])\s")
for match in pattern.findall(s):
    print(match)

Output

('This is a sentence which contains keyword1.', 'keyword1', '')
('And keyword2 is inside this sentence.', '', 'keyword2')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading