hettlage (133) [Avatar] Offline
#1
Shouldn't the line

wordvecs = np.zeros((len(vocab), len(vocab)), int)


read

wordvecs = np.zeros((len(sentence.split()), len(vocab)), int)
?
hobs (58) [Avatar] Offline
#2
Great catch! That fix will make the explanation a lot clearer. We got (un)lucky with our example document which happens to have 10 unique tokens (vocabulary and sentence word length are the same). So I didn't notice the missing rows.

Did you find the "player piano" example helpful or confusing?

Are you typing out all of the examples by hand, or copying and pasting into your python/ipython console? Would you prefer we eliminated the ">>>" prompts (docstring convention) from the examples. The prompt characters confuse a standard `python` console because they are not part of the `python` syntax, but an `ipython` console will parse and ignore them.
hettlage (133) [Avatar] Offline
#3
I have to confess I didn't pay too much attention to the piano player example. I'm not convinced it adds terribly much clarity, but your mileage may vary, of course.

However, there is a question that does spring to mind when you discuss the sheer size of the matrix of word vectors: Why don't you just store a number ("the word is the n-th entry in your vocabulary") instead of a one-hot vector?

I indeed copy-and-paste (or type) at least some of the examples. I'd say that if you include the Python output, you should rather keep the ">>>" prompts for the sake of clarity.
hobs (58) [Avatar] Offline
#4
That's helpful feedback. You're right that a sequence of word indexes would be much more space-efficient. But I don't see it used much for NLP, because it's not that different from a sequence of words (strings). Strings for words are themselves sequences of bytes not much bigger than a 64-bit integer, on average. So in python it's easiest to just list the strings themselves. The value of one-hot vectors will become apparent when we introduce neural nets in Chapter 6 (Word2vec) and Chapter 9 (CNNs) where these one-hot vectors are used directly as the input layer. For a neural net they are "replayed" into the logic (mechanism), like the holes in a paper roll toggling the levers of a player piano(wikipedia), or the bumps on a metal disk flicking the the tongs of a music box(wikipedia).

Perhaps we should wait until they are used in an example pipeline to introduce one-hot vectors.