Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filter by longest sequence and delete shorter sequences by duplicates

I’ve got a problem and I don’t really know how to solve it with Python.

I want to filter my output of words and sequences and I want to remove ‘duplicates’. I put duplicates in quotation marks because they are not real duplicates.
The filter should prefer the longest possible sequence without any duplicate words. See the following example for clarity.

Example output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

out: ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']

After it’s going through the filter, only ['quick brown fox', 'grey wolf', 'brown hare'] should be in the result.
I don’t know if it’s possible to filter ['brown'] but keeping both [‘quick brown fox’] and [‘brown hare’]…

I hope that I described my problem clearly enough.
Any help and ideas are very welcome! Thanks in advance.

>Solution :

In other words, you need to find all sequences that are not subsets of any other sequence.

Let’s proceed like this:

  • convert each item into a pair text, set of words
  • sort the list by the set length descending
  • for every pair, check if the set component is a subset of any preceding set
  • if not, add to the result

inp = ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']

pairs = [
    [s, set(s.split())]
    for s in inp
]

pairs.sort(key=lambda p: -len(p[1]))

res = [
    pair[0]
    for n, pair in enumerate(pairs)
    if all(not pair[1].issubset(prev[1]) for prev in pairs[:n])
]

print(res) # ['quick brown fox', 'grey wolf', 'brown hare']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading