578126 (5) [Avatar] Offline
#1
In Chapter 4 (Listing 4.6 and Figure 4.3), you provide a down-to-earth evaluation of whether the number of putative topics (derived by SVD) is reasonable or not. Let's call this number of topics k.

I would like to follow a similar approach to determine a reasonable k within a Gensim-based LSA pipeline on a 44k-document, 30k-term corpus. Memory has sometimes been insufficient with my 16GB RAM laptop running Python 3.5 on Ubuntu 18.04.

I am currently trying to figure out a way around memory limits by working with sparse arrays, for which Gensim provides a utility to translate its model objects.

It would greatly enhance this section if you could provide guidance on how to minimize memory footprint. Some industrial strength code in Listing 4.6, despite the toy size of the example data, would be even better. I read Chapter 13 and did not find such guidance.

Some questions I am researching include:
1. Matrix operations on sparse arrays and ambiguity of which dot-product method to use (Scipy, Numpy, or is it auto-routed based on the type of input?)

2. How does scikit-learn's TruncSVD manage to provide explained_variance_ratio_ whereas Gensim does not? Is explained_variance_ratio_ the actual retained variance (variance normalized by the total variance in X, or just up to the number of components (k) specified)? If so, it implies that TruncSVD calculates the full S matrix, where S is the diagonal matrix of the singular values s, through n (len(terms_in_corpus)), not just k, while it isn't possible to do so with Numpy's SVD method or Gensim (by setting k to n.

3. How to calculate the total variance of a TF-IDF corpus another way besides the sum over S, which cannot be computed directly due to memory limits. Expressions such as variance = E[X^2] - E[X]^2 are feasible, where X is the TF-IDF corpus, but it doesn't yield comparable results to the sum of S over k; they differ a lot in magnitude. To be clearer:

a = csr_matrix.mean(tdm)  # csr_matrix is a Scipy sparse matrix utility
b = csr_matrix.multiply(tdm, tdm)
corpus_variance = csr_matrix.mean(b) - a ** 2

corpus_variance2 = sparse_norm(tdm) / tdm.shape[1]


After rewatching Andrew Ng's ML lecture on PCA, it's clear that while the ratio of matrix norms should be equivalent to (1 minus...) the ratio of sum of S, the respective terms of these ratios are in different units.

I'll close with the code I'm currently working on, with apologies for not providing a fully reproducible example (the data, after all, are quite large).

# For explanation, not reproduction

# 1. Obtain singular vectors / values from Gensim LSI model's SVD
U = lsi.projection.u  # lsi is a model object
s_k = lsi.projection.s  # 1-D vector
V = corpus2dense(lsi[corpus], len(s_k)).T / s_k  # derived per Gensim's FAQ #3 https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi

# 2. use Scipy csr_matrix to convert dense to sparse arrays
U_s = csr_matrix(U)
S_s = csr_matrix(S)
V_s = csr_matrix(V)

# 3. Compute dot product of three sparse arrays
# Unclear if operators .dot and .T can be used or if special Scipy methods are required
reconstructed_tdm = U_s.dot(S_s).dot(V_s.T)

# 4. Compute reconstruction error; again, unclear if these operators can be used with sparse arrays
reconstruction_error = np.sqrt(((reconstructed_tdm - tdm).values.flatten() **2).sum() / np.product(tdm.shape))
hobs (58) [Avatar] Offline
#2
Wow! That's a lot of useful insight and some tricky questions! ;)

Regarding "industrial strength" and memory-efficient SVD, we'll definitely attempt to address that in a second edition of the book. In the mean time you may have to do some more research yourself. The `gensim` source code can be a great source of ideas and patterns for "out of core" processing, incremental optimizers/solvers, and sparse matrix multiplication. However they implemented their algorithms is probably a good way to go, because it minimizes the memory footprint on my machine whenever I'm careful about not instantiating my entire corpus or bags of words in memory, but rather generating them as-needed.

Regarding your three "researching" questions, please post your discoveries here whenever you have an update on your research. Here are my thoughts:

1. `numpy` can handle sparse matrix multiplication just fine. Scipy just routes it's linear algebra operators there:

>>> import numpy as np
>>> import scipy
>>> id(scipy.dot) == id(np.dot)
True
>>> A = scipy.sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
>>> v = scipy.sparse.csr_matrix([[1], [0], [-1]])
>>> A.dot(v)
<3x1 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> scipy.dot(A, v)
<3x1 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> np.dot(A, v)
<3x1 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>


2. TruncatedSVD and it's explained_variance method are implemented in `sklearn.decomposition.TruncatedSVD`. Scipy's equivalent is in `scipy.sparse.linalg.svds`. Gensim's is in `gensim.models.lsimodel.stochastic_svd`. You can calculate explained variance yourself from the gensim SVD results (or any SVD) with:

>>> import numpy as np
>>> import sklearn.decomposition
>>> svd = sklearn.decomposition.TruncatedSVD(2)
>>> A = np.random.randn(1000,100)
>>> svd.fit(A)
TruncatedSVD(algorithm='randomized', n_components=2, n_iter=5,
       random_state=None, tol=0.0)
>>> A_2D = svd.transform(A)
>>> np.var(A_2D, axis=0)
array([1.7039018 , 1.65362273])
>>> var = A.shape[1] * np.var(A_2D, axis=0) / np.var(A, axis=0).sum()
>>> var
array([1.6955373 , 1.64550506])
>>> np.abs(var - svd.explained_variance_).round(2)
array([0., 0.])


3. I think you just need to correct the math you're using to compute the total variance. You should be squaring, then summing then dividing by `shape[1] - 1` not `shape[1]`. But `np.var` will do all that efficiently on a sparse matrix for you. However, it won't do it incrementally (also called "out-of-core processing" ). So if you have a severe RAM restriction you should research techniques for incremental computation of things like sum() and var(), or just copy the implementations in the gensim source code.

In general I'd recommend using a computational graph framework like Spark, TensorFlow, Keras, Hadoop, whenever you can't get things done with server that has large RAM. This is the "industrial strength" out-of-core processing that you are searching for.