Principal Component Analysis

Cornell CS 4/5780

Spring 2022

Whereas k-means clustering sought to partition the data into homogeneous subgroups, principal component analysis (PCA) will seek to find, if it exists, low-dimensional structure in the data set {x}i=1n (as before, xiRd). This problem can be recast in several equivalent ways and we will see a few perspectives in these notes. Accordingly, PCA has many uses including data compression (analogous to building concise summaries of data points), item classification, data visualization, and more.

First, we will consider a simple mathematical model for data that directly motivates the PCA problem. Assume there exists some unit vector uRd such that xiαiu for some scalar αi.12 While xi is high dimensional (assuming d is large), there is a sense in which it could be well approximated by a much smaller number of “features.” In fact, given u (which is the same for all xi) we could well approximate our data using only n numbers—the αi. More concisely, we say that the xi approximately lie in a subspace of dimension 1. Moreover, assuming the variability in αi is larger than whatever is hiding in the approximately equals above, just knowing the coefficients αi would also explain most of the variability in the data—if we want to understand how different various xi are we could simply compare αi. This is illustrated in fig. 9, where we see two dimensional data that approximately lies in a one dimensional subspace.

Figure 9: An example where two dimensional data approximately lies in a one dimensional subspace.

More generally, we will describe PCA from two perspectives. First we will view PCA as finding a low-dimensional representation of the data that captures most of the interesting behavior. Here, “interesting” will be defined as variability. This is analogous to computing “composite” features (i.e., linear combinations of entries in each xi) that explain most of the variability in the data. Second, we will see how PCA can be thought of as providing low-dimensional approximation to the data that is the best possible given the dimension (i.e., if we only allow ourselves k dimensions to represent d dimensional data we will do so in the best possible way).

Centering the data

Before proceeding with a mathematical formulation of PCA there is one important pre-processing step we need to perform on the data. Typically, in unsupervised learning we are interested in understanding relationships between data points and not necessarily bulk properties of the data.13 Taking the best approximation view of PCA this highlights a key problem of working with raw data points. In particular if the data has a sufficiently large mean, i.e., μ=1nixi is sufficiently far from zero, the best approximation of each data point is roughly μ. An example of this is seen in fig. 10.

Figure 10: For data with a non-zero mean the best approximation is achieved using a vector similar to the mean; in contrast, most of the interesting behavior in the data may occur in completely different directions.

While finding μ this tells us something about a data set, we know how to compute it and that is not the goal of PCA. Therefore, to actually understand the relationship between features we would like to omit this complication. Fortunately, this can be accomplished by centering our data before applying PCA. Specifically, we let x^i=xiμ,
where μ=1nixi. We now simply work with the centered feature vectors {x^i}i=1n and will do so throughout the remainder of these notes. Note that in some settings it may also be important to scale the entries of x^i (e.g., if they correspond to measurements that have vastly different scales).

Maximizing the variance

The first goal we will use to describe PCA is finding a small set of composite features that capture most of the variability in the data. To illustrate this point, we will first consider finding a single composite feature that captures as much of the variability in the data set as possible. Mathematically, this corresponds to finding a vector ϕRd such that the sample variance of the scalars zi=ϕTx^i is as large as possible.14 By convention we require that ϕ2=1, otherwise we could artificially inflate the variance by simply increasing the magnitude of the entries in ϕ. Using the definition of sample variance we can now formally define the first principal component of a data set.

First principal component: The first principal component of a data set {xi}i=1n is the vector ϕRd that solves maxϕ2=11ni=1n(ϕTx^i)2.(4)

When we discussed k-means above we framed it as an optimization problem and then discussed how actually solving that problem was hard. In this case we do not have such a complication—the problem in eq. 4 has a known solution. It is useful to introduce a bit more notation to state the solution. Specifically, we will consider the data matrix X^=[||x^1x^n||]. This allows us to rephrase eq. 4 as maxϕ2=1X^Tϕ2. In other words, ϕ is the unit vector that the matrix X^T makes as large as possible.

At this point we need to (re)introduce one of the most powerful matrix decompositions—the singular value decomposition (SVD).15 To simplify this presentation we make the reasonable assumption that nd.16 In particular, we can always decompose the matrix X^ as X^=UΣVT where U is an d×d matrix with orthonormal columns, V is an n×d matrix with orthonormal columns, Σ is a d×d diagonal matrix with Σii=σi0, and σ1σ2σd. Letting U=[||u1ud||],V=[||v1vd||],andΣ=[σ1σd] we call ui the left singular vectors, vi the right singular vectors, and σi the singular values. While methods for computing the SVD are beyond the scope of this class, efficient algorithms exist that cost O(nd2) and, in cases where n and d are large it is possible to efficiently compute only the first few singular values and vectors (e.g., for i=1,,k).

Under the mild assumption that σ1>σ2 the SVD the solution to eq. 4 becomes apparent: ϕ=u1.17 All ϕ can be written as ϕ=i=1daiui where iai2=1 (because we want ϕ2=1). We now observe that X^Tϕ22=VΣUTϕ22=i=1d(σiai)vi22=i=1d(σiai)2, so the quantity is maximized by setting a1=1 and ai=0 for i1.

So, the first left singular value of X^ gives us the first principal component of the data. What about finding additional directions? A natural way to set up this problem is to look for the composite feature with the next most variability. Informally, we could look for ψ such that yi=ψTx^i has maximal sample variance. However, as stated we would simply get the first principal component again. Therefore, we need to force the second principal component to be distinct from this first. This is accomplished by forcing them to be orthogonal, i.e., ψTϕ=0. While this may seem like a complex constraint it is actually not here. In fact, the SVD still reveals the solution: the second principal component is ψ=u2. Fig. 11 illustrates how the first two principal components look for a stylized data set. We see that they reveal directions in which the data varies significantly.

Figure 11: Two principal components for a simple data set.

More generally, we may want to consider the top k principal components. In other words the k directions in which the data varies the most. We denote the principal component as ϕ and to enforce that we find different directions we require that ϕTϕT=0 for . We also order them by the sample variance of ϕTx^i, i.e., ϕTx^i has greater variability than ϕTx^i for <.

While we will not present a full proof, the SVD actually gives us all the principal components of the data set {xi}i=1n.

Principal components: Principal component of data set {xi}i=1n is denoted ϕ and satisfies ϕ=u,
where X^=UΣVT is the SVD of X^.

Explaining variability in the data

Our discussion of PCA started with considering variability in the data. Therefore, we would like to consider how the principal components explain variability in the data. First, by using the SVD of X^ we have already computed the sample variance for each principal component. In fact, we can show that Var(ϕTx^i)=σ2/n. In other words, the singular values reveal the sample variance of ϕTx^i.

Since we may want to think of principal components as providing a low-dimensional representation of the data a reasonable question to ask is how many we should compute/use for downstream tasks. One way to address this question is to try and address a sufficient fraction of the variability in the data. In other words, we pick k such that ϕ1,,ϕk capture most of the variability in the data. To understand this we have to understand what the total variability of the data is. Fortunately, this is can be easily computed as j=1di=1n(x^i(j))2. In other words we simply sum up the squares of all the entries in all the data points.

What is true, though less apparent, is that the total variability in the data is also encoded in the singular values of X^ since j=1di=1n(x^i(j))2=i=1dσi2. Similarly we can actually encode the total variability of the data captured by the first k principal components as i=1kσi2. (This is a consequence of the orthogonality of the principal components.) Therefore, the proportion of total variance explained by the first k principal components is i=1kσi2i=1dσi2. So, we can simply pick k to explain whatever fraction of the variance we want. Or, similar to with k-means we can pick k by identifying when we have diminishing returns in explaining variance by adding more principal components, i.e., looking for a knee in the plot of singular values.

Best approximations

We now briefly touch on how PCA is also solving a best approximation problem for the centered data x^i. Specifically, say we want to approximate every data point x^i by a point wi in a fixed k dimensional subspace. Which subspace should we pick to minimize i=1nx^iwi22? Formally, this can be stated as finding a n×k matrix W with orthonormal columns that solves minWRn×dWTW=Ii=1nx^iW(WTx^i)22.(5) This is because if we force wi=Wzi for some ziRk (i.e., wi lies in the span of the columns of W) the choice of zi that minimizes x^iWzi2 is zi=WTx^.

While this is starting to become a bit predictable, the SVD again yields the solution to this problem. In particular, the problem specified in eq. 5 is solved by setting the columns of W to be the first k left singular vectors of X^ or, analogously, the first k principal components. In other words, if we want to project our data x^i onto a k dimensional subspace the best choice is the subspace defined by the first k principal components. Notably, this fact is tied to our definition of best as the subspace where the sum of squared distances between the data points and their projections are minimized. The geometry of this is illustrated in fig. 12.

Figure 12: The best one dimensional approximation to a two dimensional data set.

PCA for visualization

A common use of PCA is for data visualization. In particular, if we have high dimensional data that is hard to visualize we can sometimes see key features of the data by plotting its projection onto a few (1, 2, or 3) principal components. For example, if d=2 this corresponds to forming a scatter plot of (ϕ1Tx^i,ϕ2Tx^i). It is important to consider that only a portion of the variability in the data is “included” in visualizations of this type and any key information that is orthogonal to the first few principal components will be completely missed. Nevertheless, this way of looking at data can prove quite useful—see the class demo.


  1. Assume that the error in this approximation is much smaller than the variation in the coefficients αi.↩︎

  2. For example, k-means is shift invariant—if we add an arbitrary vector wRd to every data point it does not change the outcome of our algorithm/analysis.↩︎

  3. We call this a composite feature because zi is a linear combinantion of the features (entries) in each data vector x^i.↩︎

  4. This is often taught using eigenvectors instead. The connection is through the sample covariance matrix Σ^=X^X^T. The singular vectors we will use are equivalent to the eigenvectors of Σ^ the the eigenvalues of Σ^ are the singular values of X^ squared.↩︎

  5. Everything works out fine if n<d but we need to write min(n,d) in a bunch of places.↩︎

  6. If σ1=σ2 the first principal component is not well defined.↩︎