Trie is Using Too Much Memory

June 23, 2022

I’m trying to get all the words made from the letters, ‘crbtfopkgevyqdzsh’ from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked

So I tried a trie (know pun intended).

web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it’s still too much. Any more efficient ideas besides trie to get the same result?

# Find the words3 word list here:  svnweb.freebsd.org/base/head/share/dict/web2?view=co

trie = {}

with open('/content/web2.txt') as words3:


    for word in words3:
        cur = trie
        for l in word:
            cur  = cur.setdefault(l, {})
            cur['word'] = True # defined if this node indicates a complete word
        
def findWords(word, trie = trie, cur = '', words3 = []):
    for i, letter in enumerate(word):
        if letter in trie:
            if 'word' in trie[letter]:
                words3.append(cur)
            findWords(word, trie[letter], cur+letter, words3 )    
            # first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )

    return [word for word in words3 if word in words3]

words3 = findWords("crbtfopkgevyqdzsh")

I’m using Pyhton3

>Solution :

A trie is overkill. There’s about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.

This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:

from collections import Counter

base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
    for line in input_file:
        line = line.rstrip()
        line_counter = Counter(line.lower())
        # Can use <= instead if on Python 3.10
        if line_counter & base_counter == line_counter:
            print(line)