# How to cross-validate PCA, clustering, and matrix decomposition models

Contributed by Alex Williams

Cross-validation is a fundamental paradigm in modern data analysis. However, it is largely applied to supervised settings, such as regression and classification. Here, the procedure is simple: fit your model on, say, 90% of the data (the training set), and evaluate its performance on the remaining 10% (the test set). However, this idea does not easily extend to other unsupervised methods, such as dimensionality reduction methods or clustering.

TL;DR I cover how cross-validation is a somewhat tricky problem for matrix factorization models (including PCA & clustering as special cases) and provide some Python code snippets for fitting these models with held out data.

# Solving Least-Squares Regression with Missing Data

Contributed by Alex Williams

I recently got interested in figuring out how to perform cross-validation on PCA and other matrix factorization models. The way I chose to solve the cross-validation problem (see my other post) revealed another interesting problem: how to fit a linear regression model with missing dependent variables. Since I did not find too many existing resources on this material, I decided to briefly document what I learned in this blog post.

# Everything you did and didn't know about PCA

Contributed by Alex Williams

Many scientists are familiar with organizing and handling data in 2D tables. For example, we might record the mRNA expression level of $p$ genes in $n$ tissue samples. We might store these data in a $n \times p$ matrix, where each row corresponds to a sample, and each column corresponds to a gene. Principle components analysis (PCA) is a standard way to reduce the dimension $p$ (which can be quite large) to something more manageable.

While it is quite common for biologists to apply PCA to their data, it is less common for them to really understand the mechanics and assumptions implicit in this analysis. Opening up the black box on a statistical technique is worthwhile in and of itself, but the real reason I'm motivated to write this is the number of seriously cool and super useful extensions/variations of PCA (e.g., Non-negative matrix factorization, Sparse PCA, Tensor Decompositions), which will have a growing impact on modern neuroscience and biology. I want to blog about techniques of this flavor for the next few posts.

# Highlights of NIPS2015

Contributed by Alex Williams

I was warned that NIPS is an overwhelming conference, but I didn’t listen because I’ve gotten used to SfN, which is several times larger. But for what NIPS lacks in size (nearly 4,000 attendees, still no joke) it more than makes up for in it’s energy. It feels like I haven’t talked about anything other than statistics and machine learning for the last 7 days, and I don’t even remember what a good night’s sleep feels like anymore. I’m writing this up on the bus home, physically and emotionally defeated. But my boss told me to consolidate some brief notes from the conference, so here is my attempt.

# Clustering is hard, except when it's not

Contributed by Alex Williams

The previous two posts (part 1, part 2) on clustering have been somewhat depressing and pessimistic. However, the reality is that scientists use simple clustering heuristics all the time, and often find interpretable results. What gives? Is the theoretical hardness of clustering flawed? Or have we just been deluding ourselves? Have we been fooled into believing results that are in some sense fundamentally flawed?

This post will explore a more optimistic possibility, which has been referred to as the “Clustering is Only Difficult When It Does Not Matter” hypothesis. Proponents argue that, while we can construct worst-case scenarios that cause algorithms to fail, clustering techniques work very well in practice because real-world datasets often have characteristic structure that more-or-less guarantees the success of these algorithms. Put differently, Daniely et al. (2012) say that “clustering is easy, otherwise it is pointless” — whenever clustering fails, it is probably because the data in question were not amenable to clustering in the first place.

# Is clustering mathematically impossible?

Contributed by Alex Williams

In the previous post, we saw intuitive reasons why clustering is a hard,[1] and maybe even ill-defined, problem. In practice, we are often stuck using heuristics that can sometimes perform quite badly when their assumptions are violated (see No free lunch theorem). Is there a mathematical way of expressing all of these difficulties? This post will cover some theoretical results of Kleinberg (2002) related to this question.

Notation. Suppose we have a set of $N$ datapoints $x^{(1)}, x^{(2)}, …, x^{(N)}$. A clustering function produces a partition (i.e. a set of clusters), based on the pairwise distances between datapoints. The distance between two points $x^{(i)}$ and $x^{(j)}$ is given by $d(x^{(i)},x^{(j)})$, where $d$ is the distance function. We could choose different ways to measure distance,[2] for simplicity you can imagine we are using Euclidean distance, $\sqrt{ (x^{(i)}-x^{(j)}) \cdot (x^{(i)}-x^{(j)})}$.

### An axiomatic approach to clustering

There are many possible clustering functions we could come up with. Some are stupid — randomly split the data into two groups — and others are useful in practice. We would like to precisely define what it means for a clustering function to be “useful in practice.”

Kleinberg (2002) proposed that the ideal clustering function would achieve three properties: scale-invariance, consistency, richness. The idea is that these principles should align with your intuitive notion of what a “good clustering function” is: