Thursday 12 January 2017

On expressiveness, learnability and generalizability of deep learning

Turing machine (aturingmachine.com)
It is a coincidence that Big Data and Deep Learning popped up at the same time, roughly around 2012. And it is told that data to deep learning is fuel to rockets (this line is often attributed to Andrew Ng, co-founder of Coursera and Chief Scientist at Baidu).

It is true that current deep learning flourishes as it leverages big, complex data better than existing techniques. Equipped with advances in hardware (GPU, HPC), deep learning applications are more powerful and useful than ever. However, without theoretical advances, big data might have remained a big pile of junk artifacts.

Let us examine three key principles to any learning system: expressiveness, learnability and generalizability, and see how deep learning fits in.

Expressiveness

This requires learning system that can:

  • Represent the complexity of the world.  It was proved in early 1990s that feedforward nets are universal function approximator. It means that any function imaginable can be represented by a suitable neural network. Note that convolutional nets are also feedforward net which represents a function that maps an image to any target values.
  • Compute anything computable. Roughly the same time, it was proved that recurrent nets are Turing-complete. It says that any program written down in a standard computer can be represented by a suitable recurrent neural network (RNN). It is even suggested that Turing machines (and even human brains) are indeed RNN.
These two theoretical guarantees are powerful enough to enable any computable applications, from object recognition to video understanding to automated translation to conversational agents to automated programmers. For example, one biggest challenge set out by OpenAI is to write a program that wins all programming challenges.


Learnability

But merely proving that there exists a neural net to do a job does not mean that we can find the net within a budget of time, unless there are efficient ways to do so. In a supervised learning setting, learnability means at least three things:

  • Have a correct computational graph that enables effective and efficient passing of information and gradient between inputs and outputs. Finding a near-optimal graph is the job of architecture engineering, which is rather an art than a science. This is because the space of architectures are exponentially large, if not infinite. A right graph helps at least two things: (i) essential information is captured, and (ii) information passing is much easier. For example, convolutional nets allow translation invariance, which is often seen in images, speech and signals. With parameter sharing and the pyramid structure, training signals are distributed evenly between layers, and even weak signals at each image patch can multiply, enabling easier learning. And current skip-connections allow much easier passing of information across hundreds of layers.
  • Have flexible optimizers to navigate the rugged landscape of objective functions. Complex computational graphs are generally non-convex, meaning it is usually impossible to find the global optima in limited time. Fortunately, adaptive stochastic gradient descents are fairly efficient, including Adam, AdaDelta, RMSProp, etc. They can find good local minima in less than a hundred of passes through data.
  • Have enough data to statistically cover all small variations in reality. Practically it means hundreds of thousand data points for moderate problems, and millions for complex problems. An immediate corollary is the need to have very powerful compute, which usually means lots of GPUs, RAM, time and patience.
Generalizability

Having a capacity to learn any function or program is not enough. The learnt program must be able to generalize to unseen data as expected. Overfitting easily occurs in modern models where millions of parameters are common. Fortunately, with lots of data, overfitting is less a problem. Also recent advances have introduced Dropout (and its cousin like Maxout, DropConnect, stochastic layers) and Batch-Norm, and they together help reduce overfitting significantly.

This is evidenced in deep nets systems that work in the wild (Self-driving cars, Google Translate/Voice, AlphaGo).

Of course, these three concepts are not enough to make deep learning work in practice. There are hundreds of models, techniques, and programming frameworks out there to make things happen.