research-hack: from nlpia.data.loaders import get_data

Is "loaders" for the beta version? ]]>

My email is huasheng0822@gmail.com, can you mail me and I have a few questions to ask you plz]]>

One suggestion I would make is defining "overfitting" when you first mention the concept. I came to the book with a linguistics background (rather than a data science/ML), and had no idea what that meant.

Thanks, and keep up the great work, I'm loving the book so far!]]>

[quote]The last matrix factor, V, ... is the "answer", the topic vector...but we also want to be able to compute it on a new bag of words or TF-IDF vector...we can compute it anew whenever we need it. To create a row in this matrix we multiply the inverse of the V matrix by any new TF-IDF vector...[/quote]

That is, it appears to say that we do not need to retain the V matrix but can recreate it on demand by...multiplying the inverse of the V matrix by any new vector? If we are trying to calculate the V matrix I'm not sure how exactly we can take its inverse before we calculate it. Undoubtedly this is described in many other resources but this seemed to be a section where you wanted to make the discussion self contained.]]>

Another small thing, in section 4.5.5 where you compare the topic vectors obtained from TruncatedSVD and PCA, you proceed to perform the cosine similarity for svd vectors. Is the cosine similarity for pca vectors missing, or did you intend it as an exercise for the reader?]]>

[code]In [1]: s="Find textbooks with titles containing 'NLP', or 'natural' and 'language',or 'computational' and 'linguistics'."

In [2]: s

Out[2]: "Find textbooks with titles containing 'NLP', or 'natural' and 'language',or 'computational' and 'linguistics'."

In [3]: len(set(s.split()))

Out[3]: 12

In [4]: import numpy as np

In [5]: np.arange(1, 12 + 1).prod()

Out[5]: 479001600[/code]

]]>

However many NLP practitioners use the term "term frequency" as shorthand for "normalized term frequency." Unfortunately "normalized" can mean a lot of different things. So it's not a good practice to be so casual in the use of the term "term frequency" the way we did. We'll fix that in Chapter 3 to make it more precise and clear. We'll use the word "frequency" to mean raw integer count, the same way statisticians use that word.

There are many possible normalizations (or weightings) of term frequency that are used in NLP. Term frequency is sometimes divided by the number of times a term or n-gram occurs in that document, so that it is normalized by the document length. But most python packages save this for last, and normalize for the 2-norm length of the TF-IDF vector, which produces a similar weighted frequency value compared to if you normalized for document length up front. For TF-IDF term frequency is normalized by the prevalence of that token (term or n-gram) in the other documents in the corpus, the number of different documents it appears in. And the list of weighted (or normalized) term frequencies for a document (the TF-IDF vector for a document) is usually normalized again by the "length" (2-norm) of that list of values (vector). This is what "vector normalization" means in most machine learning worlds. In the end, for each document you get a normalized TF-IDF vector of "weighted term frequency" values between zero and one. And all your documents will have vectors of the same magnitude (2-norm) of 1. Most TF-IDF calculations (including [tt][color=darkred]gensim[/color][/tt] and [tt][color=darkred]SciKit Learn[/color][/tt]) do it this way.

The editors are reviewing our updates to Chapters 1-3, which includes this code correction. We'll also add more details about "term frequency." And we've already given the editors 4 new chapters that they are reviewing "early this week." They'll hopefully release them to you soon.

[b]BONUS[/b]: Attached is a diagram we are working on for the chapter on semantic analysis. It relies on "compressing" the TF-IDF vectors down to a more reasonable size and producing a much more meaningful vector using a linear algebra algorithm called SVD and then truncating that vector ([tt][color=darkred]TruncatedSVD[/color][/tt] in SciKit Learn). I just thought I'd give you a head start on picking up this really cool technique. Obviously, it'll make more sense with the chapter text to go along with it. When looking at the matrices in the attached SVD equation, keep in mind that we are trying to transform our TFIDF vectors (the [b]W[/b] matrix on the far left) into the topic vectors in the last matrix (the [b]U[/b] matrix on the far right). To do that we use the SVD algorithm to compute the word-topic matrix ([b]V[/b] just to the right of the equal sign). We truncate it by ignoring the topics with low magnitude in the singular values matrix ([b]S[/b] in the middle) to keep the dimensions low. And then we just muliply this smaller V matrix (using the inner product or dot product) by the TF-IDF vector for any document, and that gives us a topic vector! ]]>

Perhaps we should wait until they are used in an example pipeline to introduce one-hot vectors.]]>

]]>