Gensim Word2Vec exhausting iterable

October 6, 2022

I’m getting the following prompt when calling model.train() from gensim word2vec

INFO : EPOCH 0: training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s

The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this:

class MyCorpus:
    def __init__(self, corpus):
        self.corpus = corpus.copy()

    def __iter__(self):
        for line in self.corpus:
            x = re.sub("(<br ?/?>)|([,.'])|([^ A-Za-z']+)", '', line.lower())
            yield utils.simple_preprocess(x)

sentences = MyCorpus(corpus)
w2v_model = Word2Vec(
    sentences = sentences,
    vector_size = w2v_size, 
    window = w2v_window, 
    min_count = w2v_min_freq, 
    workers = -1
    )

The corpus variable is a list containing sentences, and each sentence is a string.

I tried the numerous "tests" to see if my class is indeed iterable, like:

    print(sum(1 for _ in sentences))
    print(sum(1 for _ in sentences))
    print(sum(1 for _ in sentences))

For instance, all of them suggest that my class is iterable, so at this point, I think the problem must be something else.

>Solution :

workers=-1 is not a supported value for Gensim’s Word2Vec model; it essentially means you’re using no threads.

Instead, you must specify the actual number of worker threads you’d like to use.

When using an iterable corpus, the optimal number of workers is usually some number up to your number of CPU cores, but not higher than 8-12 if you’ve got 16+ cores, because of some hard-to-remove inefficiencies in both the Python’s Global Interpreter Lock ("GIL") and the Gensim master-reader-thread approach.

Generally, also, you’ll get better throughput if your iterable isn’t doing anything expensive or repetitive in its preprocessing – like any regex-based tokenization, or a tokenization that’s repeated on every epoch. So best to do such preprocessing once, writing the resulting simple space-delimited tokens to a new file. Then, read that file with a very-simple, no-regex, space-splitting only tokenization.

(If performance becomes a major concern on a large dataset, you can also look into the alternate corpus_file method of specifying your corpus. It expects a single file, where each text is on its own line, and tokens are already just space-delimited. But it then lets every worker thread read its own range of the file, with far less GIL/reader-thread bottlenecking, so using workers equal to the CPU core count is then roughly optimal for throughput.)