Highlights of NIPS201514 Dec 2015
I was warned that NIPS is an overwhelming conference, but I didn’t listen because I’ve gotten used to SfN, which is several times larger. But for what NIPS lacks in size (nearly 4,000 attendees, still no joke) it more than makes up for in it’s energy. It feels like I haven’t talked about anything other than statistics and machine learning for the last 7 days, and I don’t even remember what a good night’s sleep feels like anymore. I’m writing this up on the bus home, physically and emotionally defeated. But my boss told me to consolidate some brief notes from the conference, so here is my attempt.
High-level (probably misinformed) thoughts on deep learning
It is pretty much impossible for me to write this post without weighing in on the deep learning craze. From my perspective, it seems like we’ve pretty much solved pattern recognition modulo some edge cases. In fact, we’ve solved it insanely well in certain domains. Take DeepMind’s Atari-playing deep net. Without a doubt, it is an incredible feat of engineering. But does it really “learn” to play the games on a conceptual level?
Consider what would happen if you flipped a game like space invaders upside down, or even simply switched the red and blue color channels on the pixels. A human could probably adjust to these changes nearly instantaneously. The deep net would be completely confused and unable to play the game. Sure, you could retrain it, and you might even argue that relearning just requires changes to the early layers of the network. However, a naïve retraining procedure could potentially modify the weights in deeper layers, destroying useful high-level abstractions that the network had already learned.
The Brains, Minds, and Machines Symposium shed great light on these issues. For me, the talks by Josh Tenenbaum and Gary Marcus were particularly insightful. The basic idea I left with (from these talks and other conversations) was that deep learning is very well-suited for pattern recognition and dealing with high-dimensional inputs. But we need to combine this with other frameworks — for example probabilistic inference and simulation — to solve many problems that humans do with ease. Prior to NIPS I had the misconception that the Bayesian/probabilistic viewpoint couldn’t scale to hard problems. What I failed to realize was that many hard problems aren’t high-dimensional problems. Josh gave the example of inferring the position of a person’s occluded limb in a crowded photo (something deep nets can’t do… yet). Put simply, we don’t have that many limbs and that many possible configurations for them to be in, so having a cognitive engine that simulates all the possibilities is feasible.
The take-home message was that cognitive science is actually pretty awesome and (potentially) has a lot to offer. Coming from a molecular/cellular neurobiology research background, this was an incredibly refreshing perspective.
Neuroscience at NIPS
High-dimensional neural spike train analysis with generalized count linear dynamical systems.
Yuanjun Gao, Lars Büsing, Krishna V. Shenoy, John P. Cunningham
Extends previous work (Kulkarni & Paninski, 2007; Pfau et al., 2013) on extracting low-dimensional dynamics from network recordings. Previous work has assumed Poisson output, but this is constrained by (mean = variance). Real spike counts are often over-dispersed, so something like a negative binomial distribution may be more accurate. The authors show how to extend traditional GLMs to accommodate this class of models (and all others in the exponential family). They also apply their method to some interesting data to demonstrate the advantages of this approach.
Convolutional spike-triggered covariance analysis for neural subunit models
Anqi Wu, Il Memming Park, Jonathan W. Pillow
The latest in the quest to fit good phenomenological models of early sensory neurons. The authors examine how to efficiently fit parameters of subunit models — in which a output neuron is activated by a layer of “subunits” with shifted linear filters and individual nonlinearities. Fitting these models is generally hard — see earlier work by Vintch et al. (2012). However, the authors show that it can be easy under certain assumptions. They derive an estimator based on the spike-triggered average and covariance; it seems to work well even when the assumptions are violated.
Synaptic Sampling: A Bayesian Approach to Neural Network Plasticity and Rewiring
David Kappel, Stefan Habenschuss, Robert Legenstein, Wolfgang Maass
I really like the motivation of this poster: synapses are highly dynamic biological units that grow, retract, and change in size and strength. Despite this indisputable fact, almost all modeling work considers synaptic weights that are stable, noiseless, and deterministically updated by learning rules. The authors construct and examine a framework where the synaptic weights stochastically explore a probability distribution.
Enforcing balance allows local supervised learning in spiking recurrent networks
Ralph Bourdoukan, Sophie Denève
In line with previous work from Denève’s group, the paper describes how to train spiking neural networks to implement a linear dynamical system. It seems like the key advance is that the rule described here is purely local (and therefore perhaps more biologically plausible).
Some talented friends from the Columbia NeuroTheory group have started a new company called Cognescent. They’re just getting started, but keep an eye on them! They want to expand their team, so get in touch if you are looking for a job.
Accelerated Proximal Gradient Methods for Nonconvex Programming
Huan Li, Zhouchen Lin
Seminal work by Nesterov produced simple methods for smooth, convex problems that achieved fast (quadratic) convergence using only information the gradient of the objective function (see Sutskever et al., 2013 for a review). This work was extended to nonsmooth, convex problems, producing “accelerated” proximal methods (Beck & Teboulle, 2009; Parikh & Boyd, 2014). This paper by Li and Lin takes it a step further to nonsmooth, nonconvex functions. Like previous work, their algorithm produces iterative updates based on the gradient (subgradient for nonsmooth cases) and an extrapolation term (which is similar, not the same, as momentum). The critical insight is that these extrapolations are potentially quite bad for nonconvex problems, so they extend Beck & Teboulle’s method by monitoring each step and correcting it when it goes off in a bad direction.
Tackling Nonconvex Optimization by Complexity Progression Hossein Mobahi
The convex envelope of a nonconvex function can be optimized efficiently, and has a unique solution that is equal to the global minimum of the original function. The problem is that it is generally intractable to compute. Mobahi and Fisher (2015) show that, for certain problems, a Gaussian smoothing of the nonconvex function can be found in closed form and is the best affine approximation of the convex envelope. You can first solve a very smoothed version of the nonconvex problem, and then solve progressively less smoothed versions of the problem (i.e. use continuation methods). The basic idea is that solving each smoothed problem gives you a very good warm start on the next, more difficult optimization problem, so you arrive at a good solution. Unfortunately, when the Gaussian smoothing can’t be computed in closed form, it is expensive to compute numerically.
Update: Check out the discussion below. Closed form smoothing is possible for common nonlinearities in deep networks! Thanks to Hossein for his comments.
There were also a couple of cool posters at this workshop:
Jacob Abernethy, Alex Kulesza, & Matus Telgarsky presented some simple intution/insights into why deep networks are typically more powerful than wide, shallow networks. The case they examine is very simple, but it nevertheless provides nice insight. There is a paper on arxiv that covers the material on the poster.
Miscellaneous things I thought were cool
Andrew Gelman gave an excellent, and thoroughly entertaining, talk on how experiments can suffer from the well-known problem of multiple-comparisons, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
One of the best paper awards, Competitive Distribution Estimation: Why is Good-Turing Good, was pretty interesting to me.
Sanjeev Arora announced a new blog on nonconvex optimization that he will write with a few colleagues.
Il Memming Park’s blog has some nice notes on the conference.
Women in Machine Learning
I’ll end with some very short comments about gender balance. First, as I already noted on Twitter, the Women in Machine Learning (WiML) poster session was fantastic. I thought it was unfortunate that it wasn’t better advertised — I only ended up there by pure accident, since I found the tutorials kind of dull. Unlike the main poster session, which was swamped with a frustrating number of people, I got to have a couple of really nice in depth conversations about things very relevant to my interests. (I even found out about a cool way of doing factor analysis on RNA expression datasets!) It would be nice if the WiML meeting was integrated into the main program — perhaps still as a parallel track, but with more general participation.
In terms of women in the general meeting, the numbers are pretty abysmal:
I’m optimistic that the insane growth in NIPS attendance will bring greater attention and pressure for the organizers to address this issue. I found Ten Simple Rules to Achieve Conference Speaker Gender Balance a thought-provoking read, both about why this is an important topic and what we can do about it.
 This brings up an interesting research question: is there a general way of identifying layers that should be retrained? Identifying layers that are working fine and should not be retrained?
 Namely, they assume that (a) the stimulus is Gaussian, (b) that the subunit nonlinearity is a second-order polynomial, and (c) the final nonlinearity is exponential.
 When the error landscape can be represented by polynomials or Gaussian radial basis functions, then the convolution/smoothing can be solved in closed form.
 This is another minor criticism I have of the conference. The proportion of posters to people attending is much too small. It became impossible to reach the front of the line and talk with the presenter. Having concurrent sessions/talks is seemingly inevitable given the rapid growth of the conference. The workshops were probably my favorite part for this reason.