503679 (1) [Avatar] Offline
#1
Hi All,

I'm analyzing the spam dataset below, to gain a better understanding of how PCA can be utilized with a classifier such as LDA (Linear Discriminant Analysis). Following the code up until page 99, I have:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
from nlpia.data.loaders import get_data

sms = get_data('sms-spam')

index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms.spam = sms.spam.astype(int)
sms.head(6)

from sklearn.decomposition import PCA

tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs, index=index)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
pca = PCA(n_components=16)
pca16_topic_vectors = pca.fit_transform(tfidf_docs)

# Additional code that I added:
pca_copy = pca.fit(tfidf_docs)
pca_copy.explained_variance_ratio_.cumsum()

However, in the last line, the explained_variance_ratio_.cumsum() does not equal to 1 - Can someone please explain why this is happening ? (Typically, the PCA component variance proportions should sum to 1.)

Thanks.
hobs (58) [Avatar] Offline
#2
Interestingly `cumsum()` isn't doing what you expected, but `sum()` still gives a sum less than `1.0` which is what you're concerned about:

>>> pca_copy.explained_variance_ratio_.cumsum()
array([0.01209829, 0.02122578, 0.03023511, 0.03861407, 0.04582388,
       0.05276413, 0.05922511, 0.06471775, 0.07010423, 0.07530074,
       0.08020538, 0.08460077, 0.08884538, 0.0930544 , 0.09717685,
       0.10115221])
>>> pca_copy.explained_variance_ratio_.cumsum()[-1]
0.10115221
>>> pca_copy.explained_variance_ratio_.sum()
0.10115221


That `0.1` total "explained variance" by all the 16 best features (components) created by PCA is still much lower than the `1.0` you were hoping for. But the reason is that that you've summed the explained variance contributed by only 16 of the 9232 possible components (the number of words in your vocabulary). Some information is lost when you create linear combinations of the 9232 features to produce 16 components/features. Unless your 9232 features can be divided into 16 groups of features that are all perfectly correlated (exactly equal to each other, so that every time one word is used another word is used in the same document) then there's no way your sum would ever equal 1.0.

Before I realized what was going on, I compared the explained variance to the imbalance in the training set and the accuracy achievable with a single-feature model. The variance that is explained by your PCA components is the variance in the original 9232 TFIDF vector components, not the variance in the target variable, so none of this is really relevant to your question:

>>> sms.spam.mean()
0.1319
>>> pca_copy.explained_variance_ratio_.sum() / sms.spam.mean()
0.7669
>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
>>> lda = LDA()
>>> lda = lda.fit(tfidf_docs, sms.spam)
>>> lda.score(tfidf_docs, sms.spam)
1.0
>>> lda_pca = LDA()
>>> lda_pca = lda_pca.fit(pca16_topic_vectors, sms.spam)
>>> lda_pca.score(pca16_topic_vectors, sms.spam)
0.9568
>>> lda_pca1 = lda_pca.fit(pca16_topic_vectors[:,0].reshape((-1,1)), sms.spam)
>>> lda_pca1.score(pca16_topic_vectors[:,0].reshape((-1,1)), sms.spam)
0.8681  # this is the best you can do if you chose only the best component to make your predictions


86.8% correlation is the best you could do if you had only one PCA component as your feature and you scaled/thresholded that feature optimally to maximize the correlation with your target variable.