Wednesday, 8 March 2017

The Matrix Encoded: Matrices as first-class citizen

One thing as popular as hydro in the universe, is vector. Most mathematical and data analytical analysis asks for this fundamental structure of the world. PCA, ICA, SVM, GMM, t-SNE, neural nets to name a few, all implicitly assume vector representation of data. The power of vector should not be underestimated. The so-called distributed representation, which is rocking the machine learning and cognitive science worlds, is nothing but vector representation of thought (in Geoff Hinton's words, referring to Skip-Thought vectors).

The current love for distributed representation of things (yes, THINGS, as in Internet-of-Things) has gone really far. There is a huge line of work on [X]2vec, where you can substitute [X] by [word], [sentence],[paragraph], [document], [node], [tweet], [edge] and [subgraph]. I won't be surprised to see thing2vec very soon.

But can you really compress structured things like sentences into vectors? I bet you could, given that the vector is long enough. After all, although the space of all possible sentences in a language is theoretically infinite, the majority of language usage is tightly packed, and in practice the sentence space can be mapped into a linear space of thousands of dimensions.

However, compressing a data begs a question of decompressing it, e.g., to generate a target sentence in another language, as in machine translation. Surprisingly, the simplistic seq2seq trick works well in translation. But since the linguistic structures have been lost to vectorization, language generation from vector will be more difficult. A better way is to treat each sentence as a matrix, where each column is a word embedding. This gives rise to the attention scheme in machine translation, which turns out to a huge success, as in the current Google's Neural Machine Translation system.

Indeed, it has been well-recognized that vectors alone are not enough to memorize long-distant events. The idea is to augment vector-based RNN with an external memory, giving rise to the recent  Memory-augmented RNNs. The external memory is nothing but a matrix.

Enter the world of matrices

Matrices in vector space are used for linear transformation, that is, to map a vector from one space, to another vector in a different space. As a mathematical object, matrices have their own life, just like vectors, e.g., matrix calculus.

In NLP, it has been suggested that noun is a vector and adjective is really a matrix. The idea is cute, because adjective "acts" on noun, which will transform the meaning of the noun.

Matrices also form a basis for parameterization of neural layers. Hence a space of multilayered neural nets is a joint space of matrices.

Our recent paper titled "Matrix-centric neural networks" (co-authored with my PhD student, Kien Do and my boss, Professor Svetha Venkatesh) pushes the line of matrix thinking to the extreme. That is, matrices are fist-class citizen. They are no longer a collection of vectors. The input, hidden layers, and the output are all matrices. The RNNs is now a model of a sequence of input matrices and a sequence of output matrices. The internal memory (as in LSTM) is also a matrix, making it resemble the Memory-augmented RNNs.

To rephrase Geoff Hinton, we want a matrix representation of thought. Somehow, our neocortex looks like a matrix -- it is really a huge thin sheet of grey matter.

May be one day we will live in the space created by matrices.


Saturday, 25 February 2017

Column bundle: a single model for multiple multipe

Supervised machine learning has a few recurring concepts: data instance, feature set and label. Often, a data instance has one feature set and one label. But there are situations when you have multi-[X], where X = instance, view (feature subset), or label. For example, in multiple instance learning, you have more then one instance, but only one label.

Things are getting interesting when you have multiple instances, multiple views and multiple labels at the same time. For example, a video clip can be considered as a set of video segments (instances), each of which has views (audio, visual frames and may be textual subtitle), and the clip has many tags (labels).

Enter Column Bundle (CLB), the latest invention in my group.

CLB makes use of the concept of columns in neocortex. In brain, neurons are arranged in thin mini-columns, each of which is thought to cover a small sensory area called receptive field. Mini-columns are bundled into super-columns, which are inter-connected to form the entire neocortex.  In our previous work, this cute concept has been exploited to build a network of columns for collective classification. For CLB, columns are arranged in a special way:

  • There is one central column that serves as the main processing unit (CPU).
  • There are input mini-columns to read inputs for multiple parts (Input)
  • There are output mini-columns to generate labels (Output)
  • Mini-columns are only connected to the central column.
Columns are recurrent neural nets with skip-connections (e.g., Highway Net, Residual Net or LSTM). Input parts can be instances, or views. The difference is only at the feature mapping: different views are first mapped into the same space.

In a sense, it looks like a neural computer without a RAM.


Sunday, 19 February 2017

Living in the future: AI for healthcare

In a not-so-distant future, it will be a routine to chat to a machine and receive medical advice from it. In fact, many of us have done this - seeking advice from healthcare sites, asking questions online and being recommended for known answers by algorithms. The current wave of AI will only accelerate this trend.

Medicine is by large a discipline of information, where the knowledge power is very asymmetric between doctors and patients. Doctors do the job well because humans are all alike, so that cases can be documented in medical textbooks and findings can be shared in journal articles and validated by others. In other words, medical knowledge is statistical, leading to the so-called evidence-based medicine (EBM). And this is exactly the reason why the current breed of machine learning - deep learning - will do well in majority of cases.

Predictive medicine

In Yann LeCun's words, the future of AI rests on predictive learning, which is basically an alternative way to say unsupervised learning. Technically, this is the capability to fill the missing slots. For those who are familiar with probabilistic graphical models, it is akin to computing pseudo-likelihood, or estimating values of some variables given the rest.

A significant part of medicine is inherently predictive. One is diagnosis - finding out what is happening now, and the other prognosis - figuring out what will be happening if an action (or absence of action) is done. While it is fair to say diagnosis is quite advanced, prognosis has a long way to go.

To my surprise as a machine learning practitioner, doctors are unreasonably poor at prediction into the future, especially when it comes to mental health and genomics. Doctors are, however, excellent in explaining the results after-the-fact. In machine learning's terms, their models can practically fit anything but do not generalize well. This must come from the culture of know-it-all, where medical knowledge is limited to only a handful of people, and doctors are obliged to explain what has happened to the poor patients.

Physical health

Human body is a physical (and to some extent, a statistical) system. Hence it follows physical laws. Physiological processes, in theory, can be fully understood and predictable - at least in a close environment. What are hard to predict, are the (results of) interactions with the open environment. For example, virus infection and car accidents are those hardly predictable. Hence, physical health is predictable up to an accuracy limit, beyond which computers have no hope in predicting. So don't expect the performance to be close to that we have seen in object recognition.

Mental health

Mental health is hard. No one can really tell what happens inside your brain, even if you have it opened. With hundreds of billions neurons and tens of trillions connections between them that give rise to mental processes, the complexity of the brain is beyond human reach at present. But mental health never goes alone. It goes hand-in-hand with physical health. A poor physical condition is likely to worsen a mental condition, and vice versa.

A good sign is that mental health is going computational. There is an emerging field called Computational Psychiatry. They are surprisingly open to new technological ideas.

The future

AI is also eating the healthcare stage with hundreds of startups popping up each month around the world. So what to expect in the near future within 5 years?
  • Medical imaging diagnosis. This is perhaps the most ready space due to the availability of affordable imaging options (CT-Scan, ultra-sound, fMRI, etc) and recent advances in computer vision, thanks to convolutional nets. One interesting form is microscopy imaging diagnosis since getting images from microscopes can be quite cheap. Another one is facial diagnosis -- It turns out, many diseases manifest through facial expression.
  • Medical text to be better understood. There are several types of text: doctor narrative in medical records, user-generated medical text online, social health places, and medical research articles. This field will take more time to take off, but given the high concentration of talents in NLP at present, we have a reason to hope.
  • Cheap, fast sequencing techniques. Sequencing cost has come down to a historic milestone of $1,000 recently, and we still have reasons to believe that it will go down to $100 in a not far future. For example, nanopore sequencing is emerging, and the sequencing using signal processing will be improved significantly
  • Faster and better understanding of genomics. Once the sequencing reaches a critical mass, the understanding of it will be accelerated by AI. Check out, for example, the work of this Toronto professor, Brendan Frey.
  • Clinical data sharing will remain a bottleneck for the years to come. Unless we have access to a massive clinical database, things will move very slowly in clinical settings. But machine learning will have to work in data efficiency regimes, too.

Beyond 5 years, it is far more difficult to predict. Some are still in the realm of sci-fi.
  • Automation of drug discovery. Drug chemical and biological properties will be estimated accurately by machine. The search for a drug given a desirable function will be accelerated by hundred times.
  • A full dialog system for diagnosis and treatment recommendation. You don't need to see doctor for a $100 consultation for just 10 mins. You want a thorough consultation for free.
  • M-health, with distant robotic surgery.
  • Brain-machine interfacing, where humans will rely on machine for high bandwidth communication. This idea is from my favorite technologist Elon Musk.
  • Nano chips will enter the body in millions and kill the nasty bugs, fix the damages and get out without being kicked out by the immune system. This idea is from the 2006 book The Singularity is Near by my favorite futurist Ray Kurzweil.
  • Robot doctors will be licensed, just like self-driving cars now.
  • Patients will be in control. No more know-it-all doctors. Patients will have a full knowledge of their own health. This implies that things must be explainable, and patients must be educated about their own bio & mental.

However, like everything else, it is easy to imagine than done. Don't forget that AI in Medicine (AIIM) is a very old journal, and nothing really magic has happened yet.

What we do

At PRaDA (Deakin University, Australia), we have our own share in this space. Some most recent contributions are:
  • Symbolic ICU  (2017), where we figure out a way to deal with ICU time-series, which are irregular and mostly missing. Again, the work will be in the public domain soon.
  • Matrix-LSTM  (2017) for EEG, where we capture the tensor-like nature of EEG signals over time. The work will be in the public domain soon.
  • DeepCare (2016), where we model the course of health trajectory, which is occasionally intervened at irregular time.
  • Deepr (2016), where we aim to discover explainable predictive motifs though CNN.
  • Anomaly detection  (2016), where we discover outliers in healthcare data, which is inherently mixed-type.
  • Stable risk discovery through Autoencoder (2016), where we discover structure among risk factors.
  • Generating stable prediction rules (2016), where we demonstrate that simple, and statistically stable rules can be uncovered from lots of administrative data for preterm-birth prediction at 25 weeks of gestation.
  • eNRBM  (2015): understanding the group formation of medical concepts through competitive learning and prior medical knowledge.

Thursday, 12 January 2017

On expressiveness, learnability and generalizability of deep learning

Turing machine (
It is a coincidence that Big Data and Deep Learning popped up at the same time, roughly around 2012. And it is told that data to deep learning is fuel to rockets (this line is often attributed to Andrew Ng, co-founder of Coursera and Chief Scientist at Baidu).

It is true that current deep learning flourishes as it leverages big, complex data better than existing techniques. Equipped with advances in hardware (GPU, HPC), deep learning applications are more powerful and useful than ever. However, without theoretical advances, big data might have remained a big pile of junk artifacts.

Let us examine three key principles to any learning system: expressiveness, learnability and generalizability, and see how deep learning fits in.


This requires learning system that can:

  • Represent the complexity of the world.  It was proved in early 1990s that feedforward nets are universal function approximator. It means that any function imaginable can be represented by a suitable neural network. Note that convolutional nets are also feedforward net which represents a function that maps an image to any target values.
  • Compute anything computable. Roughly the same time, it was proved that recurrent nets are Turing-complete. It says that any program written down in a standard computer can be represented by a suitable recurrent neural network (RNN). It is even suggested that Turing machines (and even human brains) are indeed RNN.
These two theoretical guarantees are powerful enough to enable any computable applications, from object recognition to video understanding to automated translation to conversational agents to automated programmers. For example, one biggest challenge set out by OpenAI is to write a program that wins all programming challenges.


But merely proving that there exists a neural net to do a job does not mean that we can find the net within a budget of time, unless there are efficient ways to do so. In a supervised learning setting, learnability means at least three things:

  • Have a correct computational graph that enables effective and efficient passing of information and gradient between inputs and outputs. Finding a near-optimal graph is the job of architecture engineering, which is rather an art than a science. This is because the space of architectures are exponentially large, if not infinite. A right graph helps at least two things: (i) essential information is captured, and (ii) information passing is much easier. For example, convolutional nets allow translation invariance, which is often seen in images, speech and signals. With parameter sharing and the pyramid structure, training signals are distributed evenly between layers, and even weak signals at each image patch can multiply, enabling easier learning. And current skip-connections allow much easier passing of information across hundreds of layers.
  • Have flexible optimizers to navigate the rugged landscape of objective functions. Complex computational graphs are generally non-convex, meaning it is usually impossible to find the global optima in limited time. Fortunately, adaptive stochastic gradient descents are fairly efficient, including Adam, AdaDelta, RMSProp, etc. They can find good local minima in less than a hundred of passes through data.
  • Have enough data to statistically cover all small variations in reality. Practically it means hundreds of thousand data points for moderate problems, and millions for complex problems. An immediate corollary is the need to have very powerful compute, which usually means lots of GPUs, RAM, time and patience.

Having a capacity to learn any function or program is not enough. The learnt program must be able to generalize to unseen data as expected. Overfitting easily occurs in modern models where millions of parameters are common. Fortunately, with lots of data, overfitting is less a problem. Also recent advances have introduced Dropout (and its cousin like Maxout, DropConnect, stochastic layers) and Batch-Norm, and they together help reduce overfitting significantly.

This is evidenced in deep nets systems that work in the wild (Self-driving cars, Google Translate/Voice, AlphaGo).

Of course, these three concepts are not enough to make deep learning work in practice. There are hundreds of models, techniques, and programming frameworks out there to make things happen.