Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Trie is Using Too Much Memory

I’m trying to get all the words made from the letters, ‘crbtfopkgevyqdzsh’ from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked

So I tried a trie (know pun intended).

web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it’s still too much. Any more efficient ideas besides trie to get the same result?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# Find the words3 word list here:  svnweb.freebsd.org/base/head/share/dict/web2?view=co

trie = {}

with open('/content/web2.txt') as words3:


    for word in words3:
        cur = trie
        for l in word:
            cur  = cur.setdefault(l, {})
            cur['word'] = True # defined if this node indicates a complete word
        
def findWords(word, trie = trie, cur = '', words3 = []):
    for i, letter in enumerate(word):
        if letter in trie:
            if 'word' in trie[letter]:
                words3.append(cur)
            findWords(word, trie[letter], cur+letter, words3 )    
            # first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )

    return [word for word in words3 if word in words3]

words3 = findWords("crbtfopkgevyqdzsh")

I’m using Pyhton3

>Solution :

A trie is overkill. There’s about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.

This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:

from collections import Counter

base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
    for line in input_file:
        line = line.rstrip()
        line_counter = Counter(line.lower())
        # Can use <= instead if on Python 3.10
        if line_counter & base_counter == line_counter:
            print(line)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading