The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

staran (8) [Avatar] Offline
#1
I have been working thru the PCA chapter to understand how to do dimensionality reduction using the mtcars dataset. How many components to retain a PCA?


1) Do we skip a response variable from the dataset we input to principal
2) Princomp allows entry of a formula ? how do we achieve something similar using principal
2) I am unable to interpret the PC1, PC2 as well as the RC1, RC2 values. I understand PC1 and PC2 contribute to 84% of the variance. So what variables are significant?

Here is the output I obtained for mtcats? Any interpretation of results will be appreciated.


> carss=principal(mtcars[,1:11],nfactors=2,score=TRUE,rotate="none"smilie
> carss
Principal Components Analysis
Call: principal(r = mtcars[, 1:11], nfactors = 2, rotate = "none",
scores = TRUE)
Standardized loadings based upon correlation matrix
PC1 PC2 h2 u2
mpg -0.93 0.03 0.87 0.131
cyl 0.96 0.07 0.93 0.071
disp 0.95 -0.08 0.90 0.098
hp 0.85 0.41 0.88 0.116
drat -0.76 0.45 0.77 0.228
wt 0.89 -0.23 0.85 0.154
qsec -0.52 -0.75 0.83 0.165
vs -0.79 -0.38 0.76 0.237
am -0.60 0.70 0.85 0.146
gear -0.53 0.75 0.85 0.150
carb 0.55 0.67 0.76 0.244

PC1 PC2
SS loadings 6.61 2.65
Proportion Var 0.60 0.24
Cumulative Var 0.60 0.84

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are 55 and the objective function was 15.4
The degrees of freedom for the model are 34 and the objective function was 2.95
The number of observations was 32 with Chi Square = 74.21 with prob < 8.1e-05

Fit based upon off diagonal values = 0.99>
> carssrc=principal(mtcars[,1:11],nfactors=2,score=TRUE,rotate="varimax"smilie
> carssrc
Principal Components Analysis
Call: principal(r = mtcars[, 1:11], nfactors = 2, rotate = "varimax",
scores = TRUE)
Standardized loadings based upon correlation matrix
RC1 RC2 h2 u2
mpg 0.68 -0.63 0.87 0.131
cyl -0.64 0.72 0.93 0.071
disp -0.73 0.60 0.90 0.098
hp -0.32 0.88 0.88 0.116
drat 0.85 -0.21 0.77 0.228
wt -0.80 0.46 0.85 0.154
qsec -0.16 -0.90 0.83 0.165
vs 0.30 -0.82 0.76 0.237
am 0.92 0.08 0.85 0.146
gear 0.91 0.17 0.85 0.150
carb 0.08 0.87 0.76 0.244

RC1 RC2
SS loadings 4.67 4.59
Proportion Var 0.42 0.42
Cumulative Var 0.42 0.84

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are 55 and the objective function was 15.4
The degrees of freedom for the model are 34 and the objective function was 2.95
The number of observations was 32 with Chi Square = 74.21 with prob < 8.1e-05

Fit based upon off diagonal values = 0.99>

> carss$scores
PC1 PC2
Mazda RX4 -0.2516308982 1.04919357
Mazda RX4 Wag -0.2409801807 0.93709938
Datsun 710 -1.0641633049 -0.08854287
Hornet 4 Drive -0.1193693965 -1.42860381
Hornet Sportabout 0.7559836345 -0.45608682
Valiant -0.0214936918 -1.68432399
Duster 360 1.1496507138 0.20246195
Merc 240D -0.7869352163 -0.88580026
Merc 230 -0.8757928531 -1.19917506
Merc 280 -0.2015385197 -0.09794749
Merc 280C -0.1949623577 -0.19581597
Merc 450SE 0.8606317650 -0.41320595
Merc 450SL 0.7840613648 -0.41305282
Merc 450SLC 0.8226243650 -0.48470540
Cadillac Fleetwood 1.4931345709 -0.50055025
Lincoln Continental 1.5139372504 -0.44337838
Chrysler Imperial 1.3756612988 -0.25460433
Fiat 128 -1.4764769496 -0.17940646
Honda Civic -1.6287652388 0.41619252
Toyota Corolla -1.6211797995 -0.16884804
Toyota Corona -0.7290594069 -1.28158472
Dodge Challenger 0.8365260349 -0.61316242
AMC Javelin 0.7134440470 -0.54801873
Camaro Z28 1.1061255250 0.41160510
Pontiac Firebird 0.8599075525 -0.52827815
Fiat X1-9 -1.3683852489 -0.07327586
Porsche 914-2 -1.0151008644 1.23716872
Lotus Europa -1.2963042050 0.83345590
Ford Pantera L 0.5256718947 2.11598496
Ferrari Dino -0.0003790165 1.94614547
Maserati Bora 1.0219431622 2.64780919
Volvo 142E -0.9267860308 0.14125103
>


Thanks in advance for your attention.

Sanjeev Taran
robert.kabacoff (170) [Avatar] Offline
#2
Re: Principal Component Analysis
Hi Sanjeev,

Good questions. Please see my responses below.

> 1) Do we skip a response variable from the dataset we
> input to principal

In princomp, you wold just leave a variable out of the forumula. In principal, you enter a list of variables or a correlation matrix. Don't include the variable in the list.

> 2) Princomp allows entry of a formula ? how do we
> achieve something similar using principal

I'm not sure you can. The principal function expects a list of variables or a correlation matrix.

> 2) I am unable to interpret the PC1, PC2 as well as
> the RC1, RC2 values. I understand PC1 and PC2
> contribute to 84% of the variance. So what variables
> are significant?
>
> Here is the output I obtained for mtcats? Any
> interpretation of results will be appreciated.
>
>
> >
> carss=principal(mtcars[,1:11],nfactors=2,score=TRUE,ro
> tate="none"smilie
> > carss
> Principal Components Analysis
> Call: principal(r = mtcars[, 1:11], nfactors = 2,
> rotate = "none",
> scores = TRUE)
> dardized loadings based upon correlation matrix
> PC1 PC2 h2 u2
> .93 0.03 0.87 0.131
> cyl 0.96 0.07 0.93 0.071
> disp 0.95 -0.08 0.90 0.098
> hp 0.85 0.41 0.88 0.116
> drat -0.76 0.45 0.77 0.228
> wt 0.89 -0.23 0.85 0.154
> qsec -0.52 -0.75 0.83 0.165
> vs -0.79 -0.38 0.76 0.237
> am -0.60 0.70 0.85 0.146
> gear -0.53 0.75 0.85 0.150
> carb 0.55 0.67 0.76 0.244
>
> PC1 PC2
> .61 2.65
> Proportion Var 0.60 0.24
> Cumulative Var 0.60 0.84
>
> Test of the hypothesis that 2 factors are
> sufficient.
>
> The degrees of freedom for the null model are 55
> and the objective function was 15.4
> he degrees of freedom for the model are 34 and the
> objective function was 2.95
> The number of observations was 32 with Chi Square =
> 74.21 with prob < 8.1e-05
>
> Fit based upon off diagonal values = 0.99>

It is vary rare to be able to interpret an unrotated solution that has more than on component. That is why we use rotation.

]> >
> carssrc=principal(mtcars[,1:11],nfactors=2,score=TRUE,
> rotate="varimax"smilie
> > carssrc
> Principal Components Analysis
> Call: principal(r = mtcars[, 1:11], nfactors = 2,
> rotate = "varimax",
> scores = TRUE)
> dardized loadings based upon correlation matrix
> RC1 RC2 h2 u2
> .68 -0.63 0.87 0.131
> cyl -0.64 0.72 0.93 0.071
> disp -0.73 0.60 0.90 0.098
> hp -0.32 0.88 0.88 0.116
> drat 0.85 -0.21 0.77 0.228
> wt -0.80 0.46 0.85 0.154
> qsec -0.16 -0.90 0.83 0.165
> vs 0.30 -0.82 0.76 0.237
> am 0.92 0.08 0.85 0.146
> gear 0.91 0.17 0.85 0.150
> carb 0.08 0.87 0.76 0.244
>
> RC1 RC2
> .67 4.59
> Proportion Var 0.42 0.42
> Cumulative Var 0.42 0.84
>
> Test of the hypothesis that 2 factors are
> sufficient.
>
> The degrees of freedom for the null model are 55
> and the objective function was 15.4
> he degrees of freedom for the model are 34 and the
> objective function was 2.95
> The number of observations was 32 with Chi Square =
> 74.21 with prob < 8.1e-05
>
> Fit based upon off diagonal values = 0.99>
>
> > carss$scores
> PC1 PC2
> 982 1.04919357
> Mazda RX4 Wag -0.2409801807 0.93709938
> Datsun 710 -1.0641633049 -0.08854287
> Hornet 4 Drive -0.1193693965 -1.42860381
> Hornet Sportabout 0.7559836345 -0.45608682
> Valiant -0.0214936918 -1.68432399
> Duster 360 1.1496507138 0.20246195
> Merc 240D -0.7869352163 -0.88580026
> Merc 230 -0.8757928531 -1.19917506
> Merc 280 -0.2015385197 -0.09794749
> Merc 280C -0.1949623577 -0.19581597
> Merc 450SE 0.8606317650 -0.41320595
> Merc 450SL 0.7840613648 -0.41305282
> Merc 450SLC 0.8226243650 -0.48470540
> Cadillac Fleetwood 1.4931345709 -0.50055025
> Lincoln Continental 1.5139372504 -0.44337838
> Chrysler Imperial 1.3756612988 -0.25460433
> Fiat 128 -1.4764769496 -0.17940646
> Honda Civic -1.6287652388 0.41619252
> Toyota Corolla -1.6211797995 -0.16884804
> Toyota Corona -0.7290594069 -1.28158472
> Dodge Challenger 0.8365260349 -0.61316242
> AMC Javelin 0.7134440470 -0.54801873
> Camaro Z28 1.1061255250 0.41160510
> Pontiac Firebird 0.8599075525 -0.52827815
> Fiat X1-9 -1.3683852489 -0.07327586
> Porsche 914-2 -1.0151008644 1.23716872
> Lotus Europa -1.2963042050 0.83345590
> Ford Pantera L 0.5256718947 2.11598496
> Ferrari Dino -0.0003790165 1.94614547
> Maserati Bora 1.0219431622 2.64780919
> Volvo 142E -0.9267860308 0.14125103
> >

The interpretability of an exporatory principal components or factor analysis is only as good as the variables entered. If there is no reason to think that the variables will cluster to form meaningful composites, you probably will not find any.

In the rotated mtcars example

1. The higher a car's score on the RC1 component, the higher rear axle ratio (drat), and more forward gears, higher mpg, lower weight, fewer cylinders, and lower displacement, and probably a manual transmission. The lower the score, the opposite.

The same approach is used to interpret RC2. Here, higher scores on the component indicate greater horsepower, more carborators, more cylinders, lower quarter mile time, lower V/S, and lower mpg.

I don't know if these are useful components. It would depend on your knowledge of cars.

Hope this helps.
staran (8) [Avatar] Offline
#3
Re: Principal Component Analysis
Robert,

Thank you for replying to my email. Very frankly, I was not sure that this forum was being monitored and really appreciate the quick response. I apologize if my questions are too naive. BTW, I really enjoyed your book and got a good grasp of the language working through your examples.

I understand your interpretation of the +ve and -ve gradient for the variables after rotation( RC1 and RC2); So we started with the following 11 variables: mpg ,cyl, disp, hp, drat,wt, qsec ,vs,am ,gear ,carb and got 2 principal components.

1) What are the components we can skip (to reduce dimensions), based upon our analysis?
2) What are the two components we are left with. Assuming these are called PC1 and PC2, a) do we work with the scores if we have to use these in our further dataset analysis and b) you mentioned in your book that for the Harman23 dataset you would take the first composite variable as mean of the standardized scores for the first four variables and PC2 as the standardized scores of the second four variables. What values should I be using here for the components.

Also, say I am working with 100 dimensions that I am trying to reduce, I do not want to end up with two principal components - as with this example - so back to my original question - what dimensions have exhibited multi-collineraity or are not significant and can be ignored. Is the intent of this exercise not to get a list of the 20-25 variables that are significant.

Thanks for your time.
Kind Regards.
robert.kabacoff (170) [Avatar] Offline
#4
Re: Principal Component Analysis
> Robert,
>
> Thank you for replying to my email. Very frankly, I
> was not sure that this forum was being monitored and
> really appreciate the quick response. I apologize if
> my questions are too naive. BTW, I really enjoyed
> your book and got a good grasp of the language
> working through your examples.
>
> I understand your interpretation of the +ve and -ve
> gradient for the variables after rotation( RC1 and
> RC2); So we started with the following 11 variables:
> mpg ,cyl, disp, hp, drat,wt, qsec ,vs,am ,gear ,carb
> and got 2 principal components.
>
> 1) What are the components we can skip (to reduce
> dimensions), based upon our analysis?
> 2) What are the two components we are left with.
> Assuming these are called PC1 and PC2, a) do we work
> with the scores if we have to use these in our
> further dataset analysis and b) you mentioned in your
> book that for the Harman23 dataset you would take the
> first composite variable as mean of the standardized
> scores for the first four variables and PC2 as the
> standardized scores of the second four variables.
> What values should I be using here for the
> components.
>
> Also, say I am working with 100 dimensions that I am
> trying to reduce, I do not want to end up with two
> principal components - as with this example - so back
> to my original question - what dimensions have
> exhibited multi-collineraity or are not significant
> and can be ignored. Is the intent of this exercise
> not to get a list of the 20-25 variables that are
> significant.
>
> Thanks for your time.
> Kind Regards.

> Thank you for replying to my email. Very frankly, I
> was not sure that this forum was being monitored and
> really appreciate the quick response.

Now that the book is finally done, I will be monitoring this forum much more closely. Thanks for your patience.

>
> I understand your interpretation of the +ve and -ve
> gradient for the variables after rotation( RC1 and
> RC2); So we started with the following 11 variables:
> mpg ,cyl, disp, hp, drat,wt, qsec ,vs,am ,gear ,carb
> and got 2 principal components.
>
> 1) What are the components we can skip (to reduce
> dimensions), based upon our analysis?

The scree test suggests that both should be retained. We started with 11 dimensions (one for each variable) and have reduced the data to 2 dimensions. If these two dimensions are interpretable and meaningful, we could replace the 11 variables with the scores on the two components in further analyses. In other words, we have reduced our problem from dealing with 11 correlated variables to dealing with 2 incorrelated variables.

> 2) What are the two components we are left with.
> Assuming these are called PC1 and PC2, a) do we work
> with the scores if we have to use these in our
> further dataset analysis and b) you mentioned in your
> book that for the Harman23 dataset you would take the
> first composite variable as mean of the standardized
> scores for the first four variables and PC2 as the
> standardized scores of the second four variables.
> What values should I be using here for the
> components.

In this case, I would probaly work with the component scores produced by the program. You will find them in the "scores" component of the list returned by the principal() function. In this case, each original variable is well described by the factors (looking at the h2 communalities column). In other problems, I may drop variables that are not well explained by the factor solution, and re-run the analysis.

>
> Also, say I am working with 100 dimensions that I am
> trying to reduce, I do not want to end up with two
> principal components - as with this example - so back
> to my original question - what dimensions have
> exhibited multi-collineraity or are not significant
> and can be ignored. Is the intent of this exercise
> not to get a list of the 20-25 variables that are
> significant.

The goal is not to identify significant variables - this is not really a feature selection process like what you will see in data mining. It is to replace a large number of correlated variables with a smaller number of (usually uncorrelated) interpretable underlying dimensions (factors), or summary scores (components). The researcher may be satified simply identifying the factors, to answer such questions as "How many types of intelligence underly an human problem solving and what are they?". In other cases, the researcher may want to simplify an analysis by replacing 100 variables with 7 or 8 summary variables determined by principal components or factor analysis. The actual decision on how may such components are needed in articulated in the chapter.

I hope this helps.



>
> Thanks for your time.
> Kind Regards.
staran (8) [Avatar] Offline
#5
Re: Principal Component Analysis
You summarized it well: The goal is not to identify the subset of significant variables from a dataset having a large number of variables but to simplify an analysis by replacing 100 variables with 7 or 8 summary variables determined by principal components or factor analysis.

Thanks for your responses.
Regards,
Sanjeev