Find common patterns in sequence of words

December 11, 2021

I have a large list of strings, in which a sequence of sounds is stored. For example:

strings = ['A','B','C','G','F','F','F','A',...,'F']

What I would like to do is perform a statistical analysis in which I would define the subsequence length and return a list (or a dictionary, which is probably more practical) that goes like this:

subsequence_length = 5
output = {['A','B','A','A','F': 129, ['B','G','G','F','F']: 112, ...}

subsequence_length = 3
output = {['A','A','F']: 209, ['G','F','F']: 198, ...}

What I have tried so far was creating a sort of kernel that follows a loop, such as:

for i in range(0, len(strings) - subsequence_length, subsequence_length):
    # count operation

I have struggled, however, with finding a fast solution (when the initial list is very large, like thousands of elements, this method is really not efficient). Is there any regex command (or similar) that can achieve this? Thanks!

>Solution :

You could use the natural language processing toolkit nltk (install with pip install nltk) to achieve your task:

output = nltk.FreqDist(nltk.ngrams(strings, subsequence_length))

Using nltk.ngrams produces sub-sequences of size subsequence_length, and then using nltk.FreqDist creates a dictionary-like counter object of the sub-sequences.