Excel: Principal Component Analysis (PCA)

magnifying-glass-1001506_640

Learning PCA with Excel

A student in Learning Tree’s Course 195, Power Excel: Analyzing Data to Make Business Decisions, asked “can you do Principal Component Analysis using Excel?” The answer that popped into my head was, “sure, but why would you want to?” If all you want are the results, you’re much better off using R or Matlab.  PCA is easy and you can get a host of important related values and explanatory plots. It occurred to me, however, that doing PCA in Excel would be a terrific way to understand what’s really happening when you do a principal component analysis, and, as an added plus, it would be a great way to review matrix manipulation in Excel.

Normalizing the Iris Data

We’ll start with the world’s most boring dataset, the iris data. These data, which only a hardcore botanist (or Ronald Fisher) could love, are well explored and available everywhere, so we can easily confirm that we have followed the correct path. Our first step will be to subtract the mean values for each of the four data vectors (i.e. columns) so that the mean of each new vector is zero, this is Step 1 in the accompanying workbook which is available here. Later steps in the workbook require the matrix macro library, available here.

To Scale or Not to Scale?

Generally, it is a good idea to divide the values in each data vector by the standard deviation for that vector so that the variance is zero. If we do not do that, then the analysis would change with the size of the numbers. For example, the iris data are in units of centimeters. We should not expect the contribution of the principal components to change if the flowers remained the same but we merely change the units to millimeters (or perhaps the Americans convert them to inches). Yet this anomalous behavior is what will occur if we do not scale.

Unfortunately, R has muddied the waters a bit by making not scaling the default behavior. In the R function prcomp, scale is FALSE by default for compatibility with S, despite the fact that scaling is in general advisable.

In Step 1 in the workbook we will therefore scale the data, as well as, adjust the mean to zero. Note that it is not necessary to do this as a separate step. All the math could be done in a single formula, but that would obscure our goal of actually seeing each individual step.

The Covariance Matrix

We now calculate the covariance matrix to illustrate that it will ultimately yield the same values as the default R analysis. We will also calculate the correlation matrix using the Excel matrix multiplication function MMULT and the TRANSPOSE function, and observe that it is nothing more than a scaled covariance matrix.

{=MMULT(TRANSPOSE(zIrisData),zIrisData)/150}

As I attempt to illustrate in Step 2a  of the PCA-IrisData workbook, the covariance matrix is the transposed data vectors (i.e. row vectors, times the column vectors, divided by the dimension). The dimension, of course, is just the number of data points, but “dimension” sounds more intellectual.

Quick Sidebar – Array Functions

You will recall that since the above formula returns an array, it has to be entered a bit differently. You highlight a four-by-four array of cells, then enter the formula without the curly braces. Then, instead of hitting <enter> you hit <ctrl><shift><enter>. Excel puts in the curly braces as a reminder that this function outputs an array; you can’t type them yourself. Bill Jelen, “Mr. Excel”, calls these array functions “CSE” functions, and I think that’s a great idea because it serves as a constant reminder that you must hit <ctrl><shift><enter>.

The Correlation Matrix

The formula for the correlation matrix is virtually the same, except we adjust the variance.

{=MMULT(TRANSPOSE(zIrisData/StandardDeviations),zIrisData/StandardDeviations)/(150)}

In this formula, ‘StandardDeviations’ is a named range containing the standard deviation for each of the four data vectors.

CorrelationMatrix

A Major Obstacle

Now we hit a serious roadblock. We need to calculate the eigenvalues and eigenvectors for the matrices we have just constructed. Excel does not provide such functions. Fortunately, there is an open source Excel macro library that provides this and much more. You can download a copy here. You can install this file as an add-in, or you can open it as a second workbook in addition to the PCA-IrisData workbook.

This professional quality library implements several different algorithms for the calculations we want, but we shall not review the differences here. We will use the QR algorithm, which gets its name from the technique of factoring a matrix into the product of an orthogonal matrix conventionally called Q and an upper triangular matrix usually called R. The QR algorithm is quite satisfactory for many commonly encountered eigenvalue problems.

In Step 3 in the workbook we get a feel for the iterative nature of the algorithm. If you click “Reset” you will see the original matrix in graphical form, and each clicking of the Iterate button takes you a bit closer to the eigenvalues, which will lie on the diagonal. Only in textbooks do the other values actually equal zero; in the real world, we settle for pretty close. Even after the bar chart no longer seems to change, you can still see the off-diagonal elements get closer to zero

Eigenvalues

We can calculate the eigenvalues of the correlation matrix as follows:

{=TRANSPOSE(MEigenvalQR(E6:H9))}

The TRANSPOSE is not necessary for the calculation, it’s simply conventional to show the eigenvalues as column headings.

Once we have the eigenvalues, a second array function will calculate an eigenvector for each of the eigenvalues.

=Meigenvec(CorrelationMatrixCopy,K5)

Our results look like this:

Scaled Results
The scaled version of the analysis is generally preferable.

It’s conventional to show the largest eigenvalue in the first column, but we made no provision for that. The largest eigenvalue, 2.91849, is in the fourth column. The eigenvector in that column therefore serves as the greatest single portion of the variation among iris species in the four measurements of interest.

This is what we would have obtained had we not scaled the data. Note that the contribution of the largest eigenvalue is exaggerated when compared to the other eigenvalues in the unscaled version.

Unscaled results
The less desirable unscaled version of the results is the default for the R function prcomp( ).

Conclusion

Principal Component Analysis (PCA), is easier to perform in applications such as R, but there are also some pitfalls, as the R function prcomp does not scales the data values by default. With a little extra effort, PCA can be performed in Excel, but the greatest benefit in doing so is not the PCA, but the greater insight that hands-on experience provides.

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.