How to cross-validate PCA, clustering, and matrix decomposition models
26 Feb 2018TL;DR I cover how cross-validation is a somewhat tricky problem for matrix factorization models (including PCA & clustering as special cases) and provide some Python code snippets for fitting these models with held out data.
Cross-validation in Linear Regression
Cross-validation is a fundamental paradigm in modern data analysis. However, it is largely applied to supervised settings, such as regression and classification. Here, the procedure is simple: fit your model on, say, 90% of the data (the training set), and evaluate its performance on the remaining 10% (the test set). However, this idea does not easily extend to other unsupervised methods, such as dimensionality reduction methods or clustering.
It is easiest to see why visually. Take a simple linear regression problem:
Here, is a matrix, $\mathbf{x}$ is vector with $n$ elements, is a vector containing $m$ elements. This model has $n$ parameters (the elements of ) and datapoints. Rows of $\mathbf{A}$ correspond to independent/predictor variables, and elements of correspond to dependent variables.
The basic idea behind cross-validation (and related techniques, like bootstrapping) is to leave out datapoints and quantify the effect on the model. We can leave out rows of $\mathbf{A}$ and corresponding elements of $\mathbf{b}$. Critically, this leaves the length of $\mathbf{x}$ unchanged — there are still $n$ variables to predict, but the number of datapoints $m$ is smaller.
Procedure to hold out data for linear regression. Note that $\mathbf{x}$ does not change in length. |
To be explicit: let $\mathbf{A}_\text{tr}$ and $\mathbf{b}_\text{tr}$ respectively denote the training set of independent and dependent variables. Then fitting our model amounts to:
The beauty is that this reduces to the same optimization problem we started with (see equation 1). Furthermore, we know how to solve this particular problem very efficiently (see least-squares). After fitting the model, we can evaluate its performance on the held-out datapoints (i.e. the test set), which we denote $\mathbf{A}_\text{te}$ and $\mathbf{b}_\text{te}$. In the end we obtain an estimate of the generalization error of our model as $\lVert \mathbf{A}_\text{te} \hat{\mathbf{x}} - \mathbf{b}_\text{te} \lVert^2$.
This procedure is easily adapted to nonlinear regression models, of course.
There are important details about when and whether this estimate of generalization error is good — if you use cross-validation for model comparison (described below in a bit), then you’ll want to split your data into three sets: training set / validation set / test set. If you select your model based on the test set, then you will generally underestimate your generalization error. You will underestimate it by more the more models you compare. I’ll ignore these (important) details for the rest of the post and refer you to Chapter 7 in Elements of Statistical Learning for more discussion.
Cross-validation in PCA
So what’s the problem with cross-validating PCA? I like to think of PCA as the following optimization problem:
This is much like linear regression, except we are optimizing over both $\mathbf{A}$ and $\mathbf{X}$ (which have been renamed to $\mathbf{U}$ and $\mathbf{V}^T$ to avoid confusion between the two models). Here, $\mathbf{Y}$ is a $m \times n$ matrix of data. In large-scale studies both $m$ and $n$ can be very high-dimensional and we may seek a simple low-dimensional linear model. This model is captured by $\mathbf{U}$ which is a tall-skinny matrix and $\mathbf{V}^T$ which is a short-fat matrix. Concretely, lets define $\mathbf{U}$ to me an $m \times r$ matrix and define $\mathbf{V}^T$ to be a $r \times n$ matrix, and choose $r$ to be less than $m$ and $n$. Then our model estimate $\hat{\mathbf{Y}} = \mathbf{U} \mathbf{V}^T$ is a rank-$r$ matrix
Some of you are probably grumpy that the above problem is not “really PCA” because PCA places an additional orthogonality constraint on the columns of $\mathbf{U}$ and $\mathbf{V}$. For a more careful discussion about the connection between PCA and equation 2 see my other post on PCA, this talk/tutorial I gave, and Madeleine Udell’s thesis work. Those resources also explain how nonnegative matrix factorization, clustering, and many other unsupervised models are also closely related to equation 2. Thus, our discussion of cross-validation will quickly generalize to these other very interesting models.
Once you are on board with the matrix factorization framework, you’ll see why cross-validation is tricky. Ultimately, our problem now looks like this:
Matrix Factorization Model. In keeping with figure 1, I've colored the model parameters (here, $\mathbf{U}$ and $\mathbf{V}$) in orange, while the data (here, $\mathbf{Y}$) is in blue. We optimize over $\mathbf{U}$ and $\mathbf{V}$ to minimize reconstruction error with respect to $\mathbf{Y}$ |
Ok, so how exactly should we hold out data in this setting? You might think we could leave out rows of $\mathbf{Y}$, however this would mean that you would have to leave out the corresponding row of $\mathbf{U}$. Thus, you couldn’t fit all of your model parameters! Likewise, leaving out a column of $\mathbf{A}$ would mean that you’d have to leave out a column of $\mathbf{V}^T$.
Not so great ideas for cross-validating matrix factorization. |
Maybe you only care about evaluating held-out/test error on our estimate of $\mathbf{V}$. That is, you don’t care about cross-validating $\mathbf{U}$ — you only care about cross-validating $\mathbf{V}$. Thus, you might suggest a two-step procedure in which we leave out rows of $\mathbf{Y}$ and fit $\mathbf{V}$ along with a subset of the rows of $\mathbf{U}$ on this training set. Then, to evaluate the performance of $\mathbf{V}$, we bend the rules of cross-validation ever so slightly and use the test set (held out rows) to fill out our estimate of $\mathbf{U}$.
Even this is not a good idea! It violates a core tenet of cross-validation that you don’t get to touch the the held out data. Your estimate of $\mathbf{V}$ could be horribly overfit, but by fitting $\mathbf{U}$ to on the test data, you could set these rows to zero and essentially blunt the effects of overfitting. Note that we have to fit this held out row of $\mathbf{U}$ in order to evaluate model performance along the corresponding row in the data matrix $\mathbf{Y}$ (which is the whole point of cross-validation).
Problems with holding out a whole column (or row) of the data matrix are discussed in more detail by Bro et al. (2008) and Owen & Perry (2009). (Just in case you don’t believe me.)
In a moment, I’ll describe a very simple cross-validation procedure that draws a connection between matrix factorization models and the well-studied matrix completion problem. But I should mention at the outset that there is much more detailed published work on this topic, which I touch on briefly at the end. Some of these articles discuss ways of leaving out entire rows and columns of the data matrix, but those procedures require a bit more care than what we will focus on.
Why cross-validate PCA (and related methods)?
Before jumping into a solution, let’s remind ourselves why cross-validation is great and why it would be great to apply it to PCA. In the case of regression and other supervised learning techniques, the goal of cross-validation is to monitor overfitting and calibrate hyperparameters. If we have a large number of regression parameters (i.e., $\mathbf{x}$ in equation 1 is a very long vector) then we’d commonly add regularization to the model. To take a very simple example, consider LASSO regression:
which will tend to produce an parameter estimate, $\hat{\mathbf{x}}$, that is sparse (containing many zeros). The scalar hyperparameter $\lambda > 0$ tunes how sparse our parameter estimate will be. If $\lambda$ is too large, then $\hat{\mathbf{x}}$ will very sparse and do a poor job of predicting $\mathbf{y}$. If $\lambda$ is too small, then $\hat{\mathbf{x}}$ will be not sparse at all, and our model could be overfit.
We can use cross-validation to estimate a good value for $\lambda$, by finding the value which minimizes the error of our model on the held-out test data. This procedure for model selection is machine learning 101 (see Chap. 7 in ESL). The basic picture looks like this:
Cross-validation schematic for LASSO dashed vertical line denotes the best value for $\lambda$, which achieves lowest error on the held-out test set. |
PCA also has an important hyperparameter that people worry about — the number of components in the model. People have published a lot of papers on this (e.g. here, here, & here). Similarly, many clustering models require the user to choose the number of clusters prior to fitting the model. Choosing the number of clusters has also received a lot of attention. Viewed from the perspective of matrix factorization, these are the exact same problem – i.e. how to choose $r$, the width of $\mathbf{U}$ and the height of $\mathbf{V}^T$.
A cross-validation procedure for matrix decomposition
Without further ado, here are some plots that demonstrate how cross-validation can help you choose the number of components in PCA, NMF, and K-means clustering. In each example, I generated data from a ground truth model with $r=4$ and then added noise. That is, for PCA, the correct number of PCs was 4. For k-means clustering, the correct number of clusters was 4. From the test error, you can see that all models begin to overfit when $r>4$. From the training data alone, the cutoff is maybe not so clear.
Cross-validation plots for some matrix decomposition models. Note that the k-means implementation is a bit noisy so I averaged across multiple optimization runs. The code to generate these plots is posted here. |
To make these plots I used a “speckled” holdout pattern (Wold, 1978). For simplicity and demonstration, I left out a small number of elements of $\mathbf{Y}$ at random:
A good solution is to hold out data at random. Importantly, we can still fit all parameters in $\mathbf{U}$ and $\mathbf{V}$ as long as no column or row of $\mathbf{Y}$ is fully removed. Intuitively, we should keep at least $r$ observations in each row or column. |
Formally, we define a binary matrix $\mathbf{M}$, which acts as a mask over our data. That is every element in $\mathbf{M}$ is either $m_{ij} = 0$, indicating a left out datapoint, or $m_{ij}=1$, indicating a datapoint in the training set. We are left with the following optimization problem:
Where $ \circ $ denotes the Hadamard product. After solving the above optimization problem, we can then evaluate the error of our model on the held-out datapoints as: $\left \lVert (1-\mathbf{M}) \circ \left ( \mathbf{U} \mathbf{V}^T - \mathbf{Y} \right ) \right \lVert_F^2$.
We need not choose $\mathbf{M}$ to be random! We can select whatever holdout pattern we like, and indeed to ensure that all columns and rows are left out at equal rates we can leave out pseudo-diagonals of the matrix. In the interest of brevity and simplicity we will sweep these choices under the rug, but see Fig.1 in Wold (1978) for more discussion.
So how do we solve this optimization problem? As I mentioned before, equation 3 amounts to the well-studied low-rank matrix completion problem. However many papers that I’ve looked at do not provide algorithms for solving the problem in the particular form that we care about. In particular, they tend to optimize over a single matrix, call it $\hat{\mathbf{Y}}$, rather than jointly optimize over a factorized representation where $\hat{\mathbf{Y}} = \mathbf{U} \mathbf{V}^T$ (which is what we’d like to do). For example, Candes and Recht (2008) consider:
where $\lVert \cdot \lVert_*$ denotes the nuclear norm of a matrix. Minimizing the nuclear norm is a good surrogate for minimizing the rank of $\hat{\mathbf{Y}}$ directly, which is more computationally challenging. The advantage of this approach is that the optimization problem is convex and therefore comes with really nice guarantees and mathematical analysis.
The problem with this is that it solves the matrix completion problem without giving us $\mathbf{U}$ and $\mathbf{V}$, which are of direct interest to us in the context of PCA and clustering. Furthermore, I don’t think this approach scales particularly well to very large matrices or tensor datasets since you are optimizing over a very large number of variables $mn$, as opposed to $mr + nr$ variables in the factorized representation.
A very simple and effective procedure for fitting matrix decomposition is the alternating minimization algorithm, which I’ve blogged and talked about in the past. For the case of PCA, this amounts to:
Algorithm: Alternating minimization:
1 initialize $\mathbf{U}$ randomly
2 while not converged
3 $\mathbf{V} \leftarrow \underset{\tilde{\mathbf{V}}}{\text{argmin}} \quad \left \lVert \mathbf{M} \circ \left ( \mathbf{U} \tilde{\mathbf{V}}^T - \mathbf{Y} \right ) \right \lVert_F^2$
4 $\mathbf{U} \leftarrow \underset{\tilde{\mathbf{U}}}{\text{argmin}} \quad \left \lVert \mathbf{M} \circ \left ( \tilde{\mathbf{U}} \mathbf{V}^T - \mathbf{Y} \right ) \right \lVert_F^2$
5 end while
While other papers (e.g. Hardt 2013) use alternating minimization to solve matrix completion problems, I haven’t come across many good sources that explain how to implement this idea in practice (please email me if you find good ones!). So I’ll give some guidance below.
Implementation Notes: PCA
Each parameter update (lines 3 and 4) of the alternating minimization algorithm boils down to a least-squares problem with missing data. In this interest of brevity, I derived the solution to least squares under missing data in a separate post. The answer is kinda cool and involves tensors! The basic solution we arrive at is this:
def censored_lstsq(A, B, M):
"""Least squares of M * (AX - B) over X, where M is a masking matrix.
"""
rhs = np.dot(A.T, M * B).T[:,:,None] # n x r x 1 tensor
T = np.matmul(A.T[None,:,:], M.T[:,:,None] * A[None,:,:])
return np.linalg.solve(T, rhs).reshape(A.shape[1], B.shape[1])
Wow — only three lines thanks to the magic of numpy broadcasting! And we can use this subroutine to cross-validate PCA as follows:
def cv_pca(data, rank, p_holdout=.1):
"""Fit PCA while holding out a fraction of the dataset.
"""
# create masking matrix
M = np.random.rand(data.shape) < p_holdout
# fit pca
U = np.random.randn(data.shape[0], rank)
for itr in range(20):
Vt = censored_lstsq(U, data, M)
U = censored_lstsq(Vt.T, data.T, M.T).T
# We could orthogonalize U and Vt and then rotate to align
# with directions of maximal variance, but we won't bother.
# return result and test/train error
resid = np.dot(U, Vt) - data
train_err = np.mean(resid[M]**2)
test_err = np.mean(resid[~M]**2)
return train_err, test_err
A few disclaimers about the above code, which is only meant to give the gist of a solution:
- In the spirit of brevity, I don’t check for convergence. Obviously, in real code, you should either monitor the reconstruction error or the change in
U
andVt
and break the loop when these converge. - The
censored_lstsq
function is also not optimized. See discussion in my other post. - If there are too many missing values (e.g. an entire row or column is left out) then the
censored_lstsq
function will fail due to a singular/non-invertible matrix.
Implementation Notes: NMF and K-means
NMF is very similar to PCA when viewed from the perspective of matrix factorization. Each subproblem becomes a nonnegative least-squares problem with missing data in the dependent variable. Nonnegative least-squares is a very well-studied problem, and the methods discussed for classic least squares can be generalized without that much effort. Jingu Kim has a really nice Python library called nonnegfac-python that handles these kind of things.
Another possibility is to modify the classic multiplicative update rules for NMF. These rules are originally:
Under missing data, these rules become:
K-means clustering was a bit more finicky (as you can see I averaged over many random initializations in the figure above). But the basic cross-validation principles are the same. See Chi et al. (2016) for fitting K-means clustering with missing data. There is also a really short and simple implementation of this idea in a Stack Overflow answer.
Conclusions and References
Everything here is a brief overview of a topic that has been studied more deeply in the literature. The take-home message is that cross-validation is a bit tricky for unsupervised learning, but there is a simple approach that generalizes to many methods. I just showed you three (PCA, NMF, and $k$-means clustering), but the basic idea likely applies elsewhere.
How can the ideas I covered here be improved upon? From a computational standpoint, leaving out data at random made our lives difficult. A nicer choice may have been to hold out some of the data from a subset of rows and columns. This means that our holdout pattern partitions the data matrix into four blocks, and without loss of generality we can rearrange the rows and columns so that the upper left block is held out as shown below:
Bi-cross-validation holdout pattern. . |
This holdout pattern was considered by Owen & Perry (2009) who attribute the basic idea to Gabriel (although he only proposed holding out a single entry of the matrix, rather than a block). A neat thing about this approach is that one can fit a PCA or NMF model to the (fully observed) bottom right block, $\mathbf{Y}_{(2,2)}$, and then use that model to predict the (held-out test) block $\mathbf{Y}_{(1,1)}$ based on the off-diagonal blocks. Understanding how this works takes a bit of effort, but the take-home message is that their approach offers significant speed ups to what I outlined in this post.
Fu & Perry (2017) recently extended the above idea to K-means clustering. And I’m still digging through it but Perry’s PhD thesis seems like a good reference as well.
Predating the above work, there is a nice review of these topics by Bro et al. (2008). They show that many previous algorithms for cross-validating PCA don’t work that well in practice and aren’t very well-motivated. So be careful if you are reading older papers on this subject!