In the first installment, we indicated that the primary reason to do a principal component analysis (PCA) in Excel was to increase our own understanding. If your goal is the PCA itself, a better choice of tool might be R, Matlab, or similar tool. Now that we have the PCA results, what exactly are those results telling us? To understand this, we must appreciate eigenvectors.
What we now call “eigenvectors” were first studied in the early 19th century. Among other things, eigenvectors proved valuable in the study of quadratic forms. Although the mathematics is very interesting, we will consider the ellipse and only briefly.
Here we see an ellipse where the axes do not line up with the “conventional” X and Y axes. The equation for an ellipse can be written in matrix form, and when we do so something interesting emerges. The eigenvectors for the matrix are aligned with the major and minor axes of the ellipse. Furthermore, the eigenvector aligned with the major axis is associated with the largest eigenvalue. When dealing with an ellipse mathematically, the eigenvectors are the basis for a new coordinate system that it easier, one can even say better, than the original X and Y coordinates.
Eigenvectors come up in statistics, pure mathematics, engineering, quantum mechanics: virtually all fields of science and technology. Just as the eigenvectors of the ellipse show the way to a better coordinate system, so do the eigenvectors of virtually any system point us to a different perspective from which to consider our problem, one that is likely to be more natural and direct.
We will consider only the first row of iris data. We can think of the values for each column as the “distance” in each or the four “directions” sepal.length, sepal.width, petal.length, and petal.width. The row then can be represented as a vector:
-0.90 sepal.length + 1.01 sepal.width + -1.34 petal.length + -1.32 petal.width
(Remember that we scaled the data for our analysis in Part I. The scaling is on the Step1 sheet of the accompanying workbook.)
What we want to do is express the row 1 data in a new “iris space” that spans four dimensions like the original data, but whose coordinate system might be preferable. Each of these new dimensions, of course, is aligned with an eigenvector. In the following equation, the vertical vectors contain the coordinates of each eigenvector.
For ease of readability, the coordinates of each eigenvector have been rounded off to two decimal places. Remember from Par I that we have not sorted the eigenvections into a descending sequence as is customary for many statistical packages; the last vector, with the as-yet unknown coefficient x4, is associated with the greatest eigenvalue and is therefore of primary interest to us.
Solving for x1, x2, x3, and x4 is simply a matter of solving a system of linear equations.
The accompanying workbook illustrates how the values of x1, x2, x3, and x4 can be calculated using the inverse of the eigenvector matrix. In general, calculating a matrix inverse is not a particulary good way to solve a single system of linear equations. But in this case, we are solving 150 systems, since we have to calculate the data vectors for each of the 150 iris data points, so it’s worth calculating the inverse once, since we use it 150 times.
On the “Interpretation” sheet of the accompanying workbook, we calculate the coordinates for the first iris data row by a little high-school algebra: multiplying both sides of the equation by the matrix inverse (in green).
In the results, we see that the left-hand side cancels out, leaving the identity matrix with ones alone the diagonal. In your high-school textbook, the off-diagonal elements were zero. In the real world, it is unrealistic to expect them to be zero; they are simply very, very small.
The column in orange represent the coefficients for the first iris row expressed in the coordinates of our new system. Note that the coeffcients for x3 and x4 are greater in absolute value than the coefficients for x1 and x2. Recall that x3 and x4 are the eigenvectors with the next greatest and greatest eigenvalues, respectively.
Purists will point out that the illustrative calculation of the princpal components does not use the preferred algorithm. They are right, of course. The workbook was created to clarify exactly what is going on in a principal component analysis and what the numbers actually mean, not to concern ourselves with performance and precision. Had this been a real PCA, the eigenvalues and their eigenvectors should be calculated using a method known as singular value decomposition, or SVD. This method suffers less roundoff error in the calculations than the matrix analysis used here. For comparing a bunch of flowers, this makes no difference at all.
We started out with four numbers for each observation, and now we have four different numbers for each observation. All four numbers are necessary to represent perfectly the original data. What we hope for is that in the new coordinate system fewer numbers may suffice, representing the data not perfectly, but well enough.
In the world of data, PCA is often applied to problems where we have data of high dimensionality and we would like to categorize it effectively, but using fewer dimensions. For example, if we were automating the sorting of fish from today’s catch, a video camera over the conveyor belt could gather many features for each passing fish, be we would like to condense that data into a small piece that tells us whether it’s a cod or a tuna.
In our iris example, we would like to identify criteria to distinguish between the three iris species represented in the data, Setosa, Versicolor, and Virginica.
Here is a box plot showing the coefficients of the single most important principal component for each of the three species of iris:
We see that along this new “dimension” defined by the principal component there is no overlap between setosa and either of the other species. The overlap between versicolor and virginica is minimal, but not negligible.
What we have accomplished is to define a single number that can be used to distinguish iris species. Of course, in the real world we are generally not trying to reduce a set of 4 numbers to a single number, but rather, fr example, to reduce 50 dimensions to 3 or 4. The resulting set of numbers might permit a quick identification of a good investment or perhaps identify a potentially defective part in an automated video inspection process.
Whether the problem is in economic theory, mathematics, or quantum physics, eigenvectors represent a new coordinate system that may make a problem more approachable. In statistics, as illustrated with principal component analysis or PCA, eigenvalues can reduce the number of variables that must be considered as our analysis proceeds. As the number of potential variables grows with “big data,” this reduction in dimensionality will become a requirement for many analyses.