578126 (5) [Avatar] Offline
#1
In Chapter 4 (Listing 4.6 and Figure 4.3), you provide a down-to-earth evaluation of whether the number of putative topics (derived by SVD) is reasonable or not. Let's call this number of topics k.

I would like to follow a similar approach to determine a reasonable k within a Gensim-based LSA pipeline on a 44k-document, 30k-term corpus. Memory has sometimes been insufficient with my 16GB RAM laptop running Python 3.5 on Ubuntu 18.04.

I am currently trying to figure out a way around memory limits by working with sparse arrays, for which Gensim provides a utility to translate its model objects.

It would greatly enhance this section if you could provide guidance on how to minimize memory footprint. Some industrial strength code in Listing 4.6, despite the toy size of the example data, would be even better. I read Chapter 13 and did not find such guidance.

Some questions I am researching include:
1. Matrix operations on sparse arrays and ambiguity of which dot-product method to use (Scipy, Numpy, or is it auto-routed based on the type of input?)

2. How does scikit-learn's TruncSVD manage to provide explained_variance_ratio_ whereas Gensim does not? Is explained_variance_ratio_ the actual retained variance (variance normalized by the total variance in X, or just up to the number of components (k) specified)? If so, it implies that TruncSVD calculates the full S matrix, where S is the diagonal matrix of the singular values s, through n (len(terms_in_corpus)), not just k, while it isn't possible to do so with Numpy's SVD method or Gensim (by setting k to n.

3. How to calculate the total variance of a TF-IDF corpus another way besides the sum over S, which cannot be computed directly due to memory limits. Expressions such as variance = E[X^2] - E[X]^2 are feasible, where X is the TF-IDF corpus, but it doesn't yield comparable results to the sum of S over k; they differ a lot in magnitude. To be clearer:

a = csr_matrix.mean(tdm)  # csr_matrix is a Scipy sparse matrix utility
b = csr_matrix.multiply(tdm, tdm)
corpus_variance = csr_matrix.mean(b) - a ** 2

corpus_variance2 = sparse_norm(tdm) / tdm.shape[1]


After rewatching Andrew Ng's ML lecture on PCA, it's clear that while the ratio of matrix norms should be equivalent to (1 minus...) the ratio of sum of S, the respective terms of these ratios are in different units.

I'll close with the code I'm currently working on, with apologies for not providing a fully reproducible example (the data, after all, are quite large).

# For explanation, not reproduction

# 1. Obtain singular vectors / values from Gensim LSI model's SVD
U = lsi.projection.u  # lsi is a model object
s_k = lsi.projection.s  # 1-D vector
V = corpus2dense(lsi[corpus], len(s_k)).T / s_k  # derived per Gensim's FAQ #3 https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi

# 2. use Scipy csr_matrix to convert dense to sparse arrays
U_s = csr_matrix(U)
S_s = csr_matrix(S)
V_s = csr_matrix(V)

# 3. Compute dot product of three sparse arrays
# Unclear if operators .dot and .T can be used or if special Scipy methods are required
reconstructed_tdm = U_s.dot(S_s).dot(V_s.T)

# 4. Compute reconstruction error; again, unclear if these operators can be used with sparse arrays
reconstruction_error = np.sqrt(((reconstructed_tdm - tdm).values.flatten() **2).sum() / np.product(tdm.shape))