Whereas k-means clustering sought to partition the data into homogeneous subgroups, principal component analysis (PCA) will seek to find, if it exists, low-dimensional structure in the data set (as before, ). This problem can be recast in several equivalent ways and we will see a few perspectives in these notes. Accordingly, PCA has many uses including data compression (analogous to building concise summaries of data points), item classification, data visualization, and more.
First, we will consider a simple mathematical model for data that directly motivates the PCA problem. Assume there exists some unit vector such that for some scalar 12 While is high dimensional (assuming is large), there is a sense in which it could be well approximated by a much smaller number of “features.” In fact, given (which is the same for all ) we could well approximate our data using only numbers—the More concisely, we say that the approximately lie in a subspace of dimension 1. Moreover, assuming the variability in is larger than whatever is hiding in the approximately equals above, just knowing the coefficients would also explain most of the variability in the data—if we want to understand how different various are we could simply compare This is illustrated in fig. 9, where we see two dimensional data that approximately lies in a one dimensional subspace.
Figure 9: An example where two dimensional data approximately lies in a one dimensional subspace.
More generally, we will describe PCA from two perspectives. First we will view PCA as finding a low-dimensional representation of the data that captures most of the interesting behavior. Here, “interesting” will be defined as variability. This is analogous to computing “composite” features (i.e., linear combinations of entries in each ) that explain most of the variability in the data. Second, we will see how PCA can be thought of as providing low-dimensional approximation to the data that is the best possible given the dimension (i.e., if we only allow ourselves dimensions to represent dimensional data we will do so in the best possible way).
Centering the data
Before proceeding with a mathematical formulation of PCA there is one important pre-processing step we need to perform on the data. Typically, in unsupervised learning we are interested in understanding relationships between data points and not necessarily bulk properties of the data.13 Taking the best approximation view of PCA this highlights a key problem of working with raw data points. In particular if the data has a sufficiently large mean, i.e., is sufficiently far from zero, the best approximation of each data point is roughly An example of this is seen in fig. 10.
Figure 10: For data with a non-zero mean the best approximation is achieved using a vector similar to the mean; in contrast, most of the interesting behavior in the data may occur in completely different directions.
While finding this tells us something about a data set, we know how to compute it and that is not the goal of PCA. Therefore, to actually understand the relationship between features we would like to omit this complication. Fortunately, this can be accomplished by centering our data before applying PCA. Specifically, we let
where We now simply work with the centered feature vectors and will do so throughout the remainder of these notes. Note that in some settings it may also be important to scale the entries of (e.g., if they correspond to measurements that have vastly different scales).
Maximizing the variance
The first goal we will use to describe PCA is finding a small set of composite features that capture most of the variability in the data. To illustrate this point, we will first consider finding a single composite feature that captures as much of the variability in the data set as possible. Mathematically, this corresponds to finding a vector such that the sample variance of the scalars is as large as possible.14 By convention we require that otherwise we could artificially inflate the variance by simply increasing the magnitude of the entries in Using the definition of sample variance we can now formally define the first principal component of a data set.
First principal component: The first principal component of a data set is the vector that solves (4)
When we discussed k-means above we framed it as an optimization problem and then discussed how actually solving that problem was hard. In this case we do not have such a complication—the problem in eq. 4 has a known solution. It is useful to introduce a bit more notation to state the solution. Specifically, we will consider the data matrix This allows us to rephrase eq. 4 as In other words, is the unit vector that the matrix makes as large as possible.
At this point we need to (re)introduce one of the most powerful matrix decompositions—the singular value decomposition (SVD).15 To simplify this presentation we make the reasonable assumption that 16 In particular, we can always decompose the matrix as where is an matrix with orthonormal columns, is an matrix with orthonormal columns, is a diagonal matrix with and Letting we call the left singular vectors, the right singular vectors, and the singular values. While methods for computing the SVD are beyond the scope of this class, efficient algorithms exist that cost and, in cases where and are large it is possible to efficiently compute only the first few singular values and vectors (e.g., for ).
Under the mild assumption that the SVD the solution to eq. 4 becomes apparent: 17 All can be written as where (because we want ). We now observe that so the quantity is maximized by setting and for
So, the first left singular value of gives us the first principal component of the data. What about finding additional directions? A natural way to set up this problem is to look for the composite feature with the next most variability. Informally, we could look for such that has maximal sample variance. However, as stated we would simply get the first principal component again. Therefore, we need to force the second principal component to be distinct from this first. This is accomplished by forcing them to be orthogonal, i.e., While this may seem like a complex constraint it is actually not here. In fact, the SVD still reveals the solution: the second principal component is Fig. 11 illustrates how the first two principal components look for a stylized data set. We see that they reveal directions in which the data varies significantly.
Figure 11: Two principal components for a simple data set.
More generally, we may want to consider the top principal components. In other words the directions in which the data varies the most. We denote the principal component as and to enforce that we find different directions we require that for We also order them by the sample variance of i.e., has greater variability than for
While we will not present a full proof, the SVD actually gives us all the principal components of the data set
Principal components: Principal component of data set is denoted and satisfies
where is the SVD of
Explaining variability in the data
Our discussion of PCA started with considering variability in the data. Therefore, we would like to consider how the principal components explain variability in the data. First, by using the SVD of we have already computed the sample variance for each principal component. In fact, we can show that In other words, the singular values reveal the sample variance of
Since we may want to think of principal components as providing a low-dimensional representation of the data a reasonable question to ask is how many we should compute/use for downstream tasks. One way to address this question is to try and address a sufficient fraction of the variability in the data. In other words, we pick such that capture most of the variability in the data. To understand this we have to understand what the total variability of the data is. Fortunately, this is can be easily computed as In other words we simply sum up the squares of all the entries in all the data points.
What is true, though less apparent, is that the total variability in the data is also encoded in the singular values of since Similarly we can actually encode the total variability of the data captured by the first principal components as (This is a consequence of the orthogonality of the principal components.) Therefore, the proportion of total variance explained by the first principal components is So, we can simply pick to explain whatever fraction of the variance we want. Or, similar to with k-means we can pick by identifying when we have diminishing returns in explaining variance by adding more principal components, i.e., looking for a knee in the plot of singular values.
Best approximations
We now briefly touch on how PCA is also solving a best approximation problem for the centered data Specifically, say we want to approximate every data point by a point in a fixed dimensional subspace. Which subspace should we pick to minimize Formally, this can be stated as finding a matrix with orthonormal columns that solves (5) This is because if we force for some (i.e., lies in the span of the columns of ) the choice of that minimizes is
While this is starting to become a bit predictable, the SVD again yields the solution to this problem. In particular, the problem specified in eq. 5 is solved by setting the columns of to be the first left singular vectors of or, analogously, the first principal components. In other words, if we want to project our data onto a dimensional subspace the best choice is the subspace defined by the first principal components. Notably, this fact is tied to our definition of best as the subspace where the sum of squared distances between the data points and their projections are minimized. The geometry of this is illustrated in fig. 12.
Figure 12: The best one dimensional approximation to a two dimensional data set.
PCA for visualization
A common use of PCA is for data visualization. In particular, if we have high dimensional data that is hard to visualize we can sometimes see key features of the data by plotting its projection onto a few (1, 2, or 3) principal components. For example, if this corresponds to forming a scatter plot of It is important to consider that only a portion of the variability in the data is “included” in visualizations of this type and any key information that is orthogonal to the first few principal components will be completely missed. Nevertheless, this way of looking at data can prove quite useful—see the class demo.
Assume that the error in this approximation is much smaller than the variation in the coefficients ↩︎
For example, k-means is shift invariant—if we add an arbitrary vector to every data point it does not change the outcome of our algorithm/analysis.↩︎
We call this a composite feature because is a linear combinantion of the features (entries) in each data vector ↩︎
This is often taught using eigenvectors instead. The connection is through the sample covariance matrix The singular vectors we will use are equivalent to the eigenvectors of the the eigenvalues of are the singular values of squared.↩︎
Everything works out fine if but we need to write in a bunch of places.↩︎
If the first principal component is not well defined.↩︎