kb0000 (9) [Avatar] Offline
#1
At the end of section 4.5.5, the Tip box recommends discarding the eigenvalues, but there is no Sigma matrix in the code examples in the section and it is not obvious to me how to discard them properly to avoid the situation the tip describes. The scikit-learn docs for TruncatedSVD.fit_transform() also do not offer much clue.

Might we get a code snippet that shows how to do this for each of the named implementations (LSA, PCA, and SVD)? If this is done elsewhere, a note mentioning that would work as well.
hobs (57) [Avatar] Offline
#2
Even thought TruncatedSVD does not ignore eigenvalues, if the resulting topic vectors are normalized to unit length that normalization accomplishes the same thing as ignoring the scaling by the sigma matrix in the first place. I will add a details to make this more clear.
578126 (5) [Avatar] Offline
#3
Ah, so I think you're trying to clarify the order of SVD and normalization. The text on p. 141 of the current PDF (section 4.4.5) indicates

Step 1. Normalizing our TF-IDF vectors...

but I think you mean

...our TF-IDF-based topic vectors.

Please confirm if this is what you meant.

Secondly, when should the data be centered? I interpreted the sequence of code to mean that the input to SVD was pre-centered*.

Overall, I was pretty convinced that the data should be standardized (centered and normalized) before SVD, as is often done prior to other model fitting. Put differently, it is unclear how post-normalization (and/or post-centering) eliminates scaling bias...'squaring up', etc. It's not your job to remediate our poor linear algebra knowledge, but clarification on the order of operations would be appreciated.

*The replacement of objects with the same name makes it harder to track through the code examples in the text. For example, you have tfidf_docs = tfidf_docs - tfidf_docs.mean() on p. 134, but it's unclear if the instance of tfidf_docs on p. 139 is the centered or original form.

What prompted all this was I was trying to center and normalize the TF-IDF vectors of 44,000 documents with a vocabulary of about 14k before TruncSVD with scikit-learn, but the process was stopped at the numpy.linalg.norm stage by SIGKILL, presumably due to excessive memory usage. Even a 1% sample failed to run.

I trust that the topic vectors are more manageable with 8 GB of RAM?!

Once again, this is an incredibly helpful book.

hobs (57) [Avatar] Offline
#4
I'm afraid I just don't know the answer to your question. It may be that there is not one "right" way to do normalization and centering. Each of the TF-IDF-based LSA implementations I've looked at does it differently. If you want to implement the underlying math in a consistent way, like other reputable implementations, you could just use the order of operations that duplicates the results obtained by other reputable implementations, like `sklearn.PCA` and `sklearn.TFIDFVectorizer`. Alternatively, you could just use the order of operations that works best for you application on your test set data.