276550 (2) [Avatar] Offline
On page 169, the author has a code that creates a TaggedDocument corpus by appending to an array. This can be slow (although not as slow as the overall processing) and use extra memory overhead.

More efficient code would be:
nrow = len(corpus)
training_corpus = np.empty(nrow, dtype=object)  # Preallocate Numpy Array of TaggedDocs
for i, text in enumerate(corpus):
    training_corpus[i] = TaggedDocument(simple_preprocess(text), [i])
hobs (58) [Avatar] Offline
You're right! That would be faster and more compact. That's especially critical for any readers building production pipelines. I'm sure you also appreciate that the array pre-allocation wouldn't be possible if your corpus is an unsized iterator, generator, or stream. And the numpy syntax would reduce the readability of the code example for some new python developers. Nonetheless, I'll mention your optimization in a "pro tip" and if you send me your name by private message I can give you credit by name instead of MEAP reader number.