Most problems of interest to organizations are multivariate. They involve multiple issues that must be looked at simultaneously. For example, when evaluating sites for a new store, we need to consider factors like cost of land, proximity to transport and local competition.
The more issues (i.e. dimensions) we have to consider, the thorny the problems become. Many statistical analysis techniques, such as machine learning algorithms, are sensitive to the number of dimensions in a problem. In the big data era, high-dimensionality can render a problem computationally intractable.
One of the advantages we, as humans, still have over computers is our ability to visually process information and identify patterns. However, once we go above two dimensions, it is difficult to display that information in a way that allows us to exploit our primeval talents.
However, there are ways of addressing the curse of high-dimensionality. Dimensionality reduction techniques, such as principal component analysis, allow us to considerably simplify our problems with limited impact on veracity.
Principal Component Analysis (PCA) is a statistical procedure that transforms and converts a data set into a new data set containing linearly uncorrelated variables, known as principal components. The basic idea is that the data set is transformed into a set of components where each one attempts to capture as much of the variance (information) in data as possible.
Consider the following data set.
This can be transformed, retaining the relationship between the observation, into the following data set—effectively transforming a two-dimensional data set into a one-dimensional data set.
We halved the dimensionality of the data set without sacrificing any of the important information.
Let’s look at how we can conduct PCA using R. We’ll use the Wine Data Set from the UCI Machine Learning Repository. This data set contains the results of chemical analysis of 178 different wines from three cultivars. There observations contain the quantities of 13 constituents found in each of the three types of wines.
The wine dataset is included in the [HDClassif] package, so let’s install that and examine the dataset.
install.packages("HDclassif") library(HDclassif) data(wine) str(wine)
Unfortunately the chemical constituents are named V1-V13. Let’s fix that.
names(wine) <- c("Type", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")
We can use the
prcomp function from the
stats package to do the PCA.
install.packages("stats") library(stats) wine_pca <- prcomp(wine, center = TRUE, scale = TRUE) summary(wine_pca)
This will display a table containing the following statistical data:
PC1 PC2 PC3 ... Cumulative Proportion 0.362 0.5541 0.6653 ...
What this tells us is that the first two components account for over 55% of the variance in the entire data set. While it would be difficult to justify basing the entire analysis on 55% of the available information, it’s interesting to see that we can account for 55% of the information with 15% of the data.
Let’s visualize the data set using the first two principal components.
This produces a scatter plot of all the wines. Wines that are close to each other should have similar chemical compositions. We can see wines 4 and 19 at the top left. Let’s compare them.
A quick review will confirm that that are indeed quite similar.
What about the red arrows (vectors) in the diagram? We can hide the wines to make it easier to view these vectors.
biplot(wine_pca, xlabs = rep("", nrow(wine)))
The vectors show the relationship between the original variables and the principal components. So, “Alcalinity of ash” is similar to PC1. The length of the vector represents the strength of the correlation between the original variable and the principal components.
Variables that are represented by vectors pointing in similar directions have similar meanings. So, “Proanthocyanins” and “Total phenols” seem to represent similar concepts in the context of our significantly reduced data set.
Obviously, this data would require more detailed analysis, but PCA and visualization using biplots are useful tools for getting to grips with high-dimensional data.
If you work with data, Learning Tree has a number of courses that may interest you: