hettlage (89) [Avatar] Offline
Section seems to be a bit ambiguous regarding the definition of "term frequency". On the one hand, the main text suggests that it is the ratio of the number of occurrences of a word and the total number of words. On the other hand, the Python code suggests it is the ratio of the number of occurrences and the number of distinct words.

For example, if you take the sentence "the faster Harry is faster", the main text suggests that TF("faster") == 2 / 5, and the Python code suggests TF("faster") == 2 / 4.
428125 (18) [Avatar] Offline
Yes, that is an error in the code. It has been corrected. Your summary definition is technically correct, and that's what we meant in the text. However, statisticians mean something very specific when they use the word "frequency." They mean the raw integer count of the occurrences of something. So in statistics (which NLP is most closely associated with these days), "term frequency" in a TF-IDF calculation is just the count of occurrences. It is not yet normalized (or weighted) for the length of the document (number of terms or tokens or n-grams in a document), or its prevalence among other documents (IDF).

However many NLP practitioners use the term "term frequency" as shorthand for "normalized term frequency." Unfortunately "normalized" can mean a lot of different things. So it's not a good practice to be so casual in the use of the term "term frequency" the way we did. We'll fix that in Chapter 3 to make it more precise and clear. We'll use the word "frequency" to mean raw integer count, the same way statisticians use that word.

There are many possible normalizations (or weightings) of term frequency that are used in NLP. Term frequency is sometimes divided by the number of times a term or n-gram occurs in that document, so that it is normalized by the document length. But most python packages save this for last, and normalize for the 2-norm length of the TF-IDF vector, which produces a similar weighted frequency value compared to if you normalized for document length up front. For TF-IDF term frequency is normalized by the prevalence of that token (term or n-gram) in the other documents in the corpus, the number of different documents it appears in. And the list of weighted (or normalized) term frequencies for a document (the TF-IDF vector for a document) is usually normalized again by the "length" (2-norm) of that list of values (vector). This is what "vector normalization" means in most machine learning worlds. In the end, for each document you get a normalized TF-IDF vector of "weighted term frequency" values between zero and one. And all your documents will have vectors of the same magnitude (2-norm) of 1. Most TF-IDF calculations (including gensim and SciKit Learn) do it this way.

The editors are reviewing our updates to Chapters 1-3, which includes this code correction. We'll also add more details about "term frequency." And we've already given the editors 4 new chapters that they are reviewing "early this week." They'll hopefully release them to you soon.

BONUS: Attached is a diagram we are working on for the chapter on semantic analysis. It relies on "compressing" the TF-IDF vectors down to a more reasonable size and producing a much more meaningful vector using a linear algebra algorithm called SVD and then truncating that vector (TruncatedSVD in SciKit Learn). I just thought I'd give you a head start on picking up this really cool technique. Obviously, it'll make more sense with the chapter text to go along with it. When looking at the matrices in the attached SVD equation, keep in mind that we are trying to transform our TFIDF vectors (the W matrix on the far left) into the topic vectors in the last matrix (the U matrix on the far right). To do that we use the SVD algorithm to compute the word-topic matrix (V just to the right of the equal sign). We truncate it by ignoring the topics with low magnitude in the singular values matrix (S in the middle) to keep the dimensions low. And then we just muliply this smaller V matrix (using the inner product or dot product) by the TF-IDF vector for any document, and that gives us a topic vector!