Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find common patterns in sequence of words

I have a large list of strings, in which a sequence of sounds is stored. For example:

strings = ['A','B','C','G','F','F','F','A',...,'F']

What I would like to do is perform a statistical analysis in which I would define the subsequence length and return a list (or a dictionary, which is probably more practical) that goes like this:

subsequence_length = 5
output = {['A','B','A','A','F': 129, ['B','G','G','F','F']: 112, ...}

subsequence_length = 3
output = {['A','A','F']: 209, ['G','F','F']: 198, ...}

What I have tried so far was creating a sort of kernel that follows a loop, such as:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

for i in range(0, len(strings) - subsequence_length, subsequence_length):
    # count operation

I have struggled, however, with finding a fast solution (when the initial list is very large, like thousands of elements, this method is really not efficient). Is there any regex command (or similar) that can achieve this? Thanks!

>Solution :

You could use the natural language processing toolkit nltk (install with pip install nltk) to achieve your task:

output = nltk.FreqDist(nltk.ngrams(strings, subsequence_length))

Using nltk.ngrams produces sub-sequences of size subsequence_length, and then using nltk.FreqDist creates a dictionary-like counter object of the sub-sequences.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading