The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

kb0000 (9) [Avatar] Offline
#1
At the end of section 4.5.5, the Tip box recommends discarding the eigenvalues, but there is no Sigma matrix in the code examples in the section and it is not obvious to me how to discard them properly to avoid the situation the tip describes. The scikit-learn docs for TruncatedSVD.fit_transform() also do not offer much clue.

Might we get a code snippet that shows how to do this for each of the named implementations (LSA, PCA, and SVD)? If this is done elsewhere, a note mentioning that would work as well.
hobs (58) [Avatar] Offline
#2
Even thought TruncatedSVD does not ignore eigenvalues, if the resulting topic vectors are normalized to unit length that normalization accomplishes the same thing as ignoring the scaling by the sigma matrix in the first place. I will add a details to make this more clear.
578126 (5) [Avatar] Offline
#3
Ah, so I think you're trying to clarify the order of SVD and normalization. The text on p. 141 of the current PDF (section 4.4.5) indicates

Step 1. Normalizing our TF-IDF vectors...

but I think you mean

...our TF-IDF-based topic vectors.

Please confirm if this is what you meant.

Secondly, when should the data be centered? I interpreted the sequence of code to mean that the input to SVD was pre-centered*.

Overall, I was pretty convinced that the data should be standardized (centered and normalized) before SVD, as is often done prior to other model fitting. Put differently, it is unclear how post-normalization (and/or post-centering) eliminates scaling bias...'squaring up', etc. It's not your job to remediate our poor linear algebra knowledge, but clarification on the order of operations would be appreciated.

*The replacement of objects with the same name makes it harder to track through the code examples in the text. For example, you have tfidf_docs = tfidf_docs - tfidf_docs.mean() on p. 134, but it's unclear if the instance of tfidf_docs on p. 139 is the centered or original form.

What prompted all this was I was trying to center and normalize the TF-IDF vectors of 44,000 documents with a vocabulary of about 14k before TruncSVD with scikit-learn, but the process was stopped at the numpy.linalg.norm stage by SIGKILL, presumably due to excessive memory usage. Even a 1% sample failed to run.

I trust that the topic vectors are more manageable with 8 GB of RAM?!

Once again, this is an incredibly helpful book.

hobs (58) [Avatar] Offline
#4
I'm afraid I just don't know the answer to your question. It may be that there is not one "right" way to do normalization and centering. Each of the TF-IDF-based LSA implementations I've looked at does it differently. If you want to implement the underlying math in a consistent way, like other reputable implementations, you could just use the order of operations that duplicates the results obtained by other reputable implementations, like `sklearn.PCA` and `sklearn.TFIDFVectorizer`. Alternatively, you could just use the order of operations that works best for you application on your test set data.