# Everything you did and didn't know about PCA

Contributed by Alex Williams

Many scientists are familiar with organizing and handling data in 2D tables. For example, we might record the mRNA expression level of $p$ genes in $n$ tissue samples. We might store these data in a $n \times p$ matrix, where each row corresponds to a sample, and each column corresponds to a gene. Principle components analysis (PCA) is a standard way to reduce the dimension $p$ (which can be quite large) to something more manageable.

While it is quite common for biologists to apply PCA to their data, it is less common for them to really understand the mechanics and assumptions implicit in this analysis. Opening up the black box on a statistical technique is worthwhile in and of itself, but the real reason I'm motivated to write this is the number of seriously cool and super useful extensions/variations of PCA (e.g., Non-negative matrix factorization, Sparse PCA, Tensor Decompositions), which will have a growing impact on modern neuroscience and biology. I want to blog about techniques of this flavor for the next few posts.

# Highlights of NIPS2015

Contributed by Alex Williams

I was warned that NIPS is an overwhelming conference, but I didn’t listen because I’ve gotten used to SfN, which is several times larger. But for what NIPS lacks in size (nearly 4,000 attendees, still no joke) it more than makes up for in it’s energy. It feels like I haven’t talked about anything other than statistics and machine learning for the last 7 days, and I don’t even remember what a good night’s sleep feels like anymore. I’m writing this up on the bus home, physically and emotionally defeated. But my boss told me to consolidate some brief notes from the conference, so here is my attempt.

# Clustering is hard, except when it's not

Contributed by Alex Williams

The previous two posts (part 1, part 2) on clustering have been somewhat depressing and pessimistic. However, the reality is that scientists use simple clustering heuristics all the time, and often find interpretable results. What gives? Is the theoretical hardness of clustering flawed? Or have we just been deluding ourselves? Have we been fooled into believing results that are in some sense fundamentally flawed?

This post will explore a more optimistic possibility, which has been referred to as the “Clustering is Only Difficult When It Does Not Matter” hypothesis. Proponents argue that, while we can construct worst-case scenarios that cause algorithms to fail, clustering techniques work very well in practice because real-world datasets often have characteristic structure that more-or-less guarantees the success of these algorithms. Put differently, Daniely et al. (2012) say that “clustering is easy, otherwise it is pointless” — whenever clustering fails, it is probably because the data in question were not amenable to clustering in the first place.

# Is clustering mathematically impossible?

Contributed by Alex Williams

In the previous post, we saw intuitive reasons why clustering is a hard,[1] and maybe even ill-defined, problem. In practice, we are often stuck using heuristics that can sometimes perform quite badly when their assumptions are violated (see No free lunch theorem). Is there a mathematical way of expressing all of these difficulties? This post will cover some theoretical results of Kleinberg (2002) related to this question.

Notation. Suppose we have a set of $N$ datapoints $x^{(1)}, x^{(2)}, …, x^{(N)}$. A clustering function produces a partition (i.e. a set of clusters), based on the pairwise distances between datapoints. The distance between two points $x^{(i)}$ and $x^{(j)}$ is given by $d(x^{(i)},x^{(j)})$, where $d$ is the distance function. We could choose different ways to measure distance,[2] for simplicity you can imagine we are using Euclidean distance, $\sqrt{ (x^{(i)}-x^{(j)}) \cdot (x^{(i)}-x^{(j)})}$.

### An axiomatic approach to clustering

There are many possible clustering functions we could come up with. Some are stupid — randomly split the data into two groups — and others are useful in practice. We would like to precisely define what it means for a clustering function to be “useful in practice.”

Kleinberg (2002) proposed that the ideal clustering function would achieve three properties: scale-invariance, consistency, richness. The idea is that these principles should align with your intuitive notion of what a “good clustering function” is:

# What is clustering and why is it hard?

Contributed by Alex Williams

I’ve been working on some clustering techniques to identify cell types from DNA methylation data. When you dive into the literature on clustering, two things becomes immediately apparent: first, clustering is fundamental to many scientific questions, and second, there is “distressingly little general theory” on how it works or how to apply it to your particular data.

This was surprising to me. I imagine that most biologists and neuroscientists come across k-means clustering, hierarchical clustering, and similar techniques all the time in papers related to their work. Given how commonplace these techniques are, one would think that we have a solid handle on how they work and what can go wrong.

This will be the first post in a short series on clustering techniques. I will try to explain why clustering is hard from a high-level, intuitive perspective. The next post will cover some more technical theoretical results. I’ll focus on Jon Kleinberg’s paper which precisely defines an ideal clustering function, but then proves that no such function exists and that there are inevitable tradeoffs that must be made. The final few posts will cover other theoretical work and some current projects of mine.

Contributed by Alex Williams

I’ve been searching for a good way to organize my thoughts/research interests for the past couple of years. This website/blog has gone through a few iterations as a result, but I feel like I’ve finally converged on something that works and is manageable for me to maintain.

I will update this post periodically as my research evolves and my interests shift. For now, I plan to be writing quick posts roughly once a month on a few different subjects/categories. For now, here is an outline:

## Tutorials:

• FORCE learning
• Independent Components Analysis

## Small stuff that I’m working on:

• Optimization and Matrix Factorization Tools in Julia

## Stuff I want to publish soon:

• Models of mRNA and protein transport in dendrites
• Clustering algorithms and factor analysis of DNA methylation patterns (and other epigenetic datasets)