Machine learns as data speak

Saturday 25 November 2017

Healthcare as an algebraic system

Throughout the history of homo sapiens, wars, famine and diseases are the main killing machines. For the first time, we have recently defeated famine and lost appetite for wars. We now turn our attention to the most basic need of all: health. Who doesn't want to live a healthy life and die in peace?

But how do understand a healthcare system, from a modeler point of view? Healthcare is a complex business. A battle field that can determine outcome of election, which may change the course of history.

I've always speculated that healthcare is an algebraic system. Medical objects such as disease, treatment, medication, theater, department and even doctor can be represented as algebraic objects. Processes, protocols, rules and the like can be defined as algebraic operators.

One simplest object representation is vector. And this is very powerful, given the fact that most data manipulation machineries are developed for vectors. Matrices, tensors, graphs and other more sophisticated representations are much less developed.

The algebraic view of things in the world has been there for long. Recently it comes in "embedding" of the objects, from word to sentence to document to graph to anything2vec.

In our recent work "Finding Algebraic Structure of Care in Time: A Deep Learning Approach", we take this view to its full. Diseases and treatments are embedded into vectors. A hospital visit is also a vector, which is computed as a difference of illness vector and treatment vector. This gives rise to simple modelling of disease-treatment motifs, and visit-visit transitions. The entire temporal progression can be modelled as recurrence of simple linear matrix-vector multiplications, or a more sophisticated LSTM.

Once the health state at each visit can be determined, decision making can be acted upon, based on the entire trajectory, with attention to the recent events, or in absence of any time-specific knowledge, using content-based addressing.

After all, healthcare is like a computer execution of a health program, which is jointly determined by the three processes: the illness, the care and the recording. It is time for a Turing machine.

Stay tuned. The future is so bright that you need to wear sun glasses.

Our recent contributions

Dual memory neural computer for asynchronous two-view sequential learning, H Le, T Tran, S Venkatesh, KDD'18
Dual control memory augmented neural networks for treatment recommendations, H Le, T Tran, S Venkatesh, PAKDD'18
Resset: A recurrent model for sequence of sets with applications to electronic medical records, P Nguyen, T Tran, S Venkatesh, IJCNN'18.
Graph Memory Networks for Molecular Activity Prediction, Trang Pham, Truyen Tran, Svetha Venkatesh, ICPR'18
Finding Algebraic Structure of Care in Time: A Deep Learning Approach, Phuoc Nguyen, Truyen Tran, Svetha Venkatesh, NIPS Workshop on Machine Learning for Health (ML4H), 2017.
Deep Learning to Attend to Risk in ICU, Phuoc Nguyen, Truyen Tran, Svetha Venkatesh, IJCAI'17 Workshop on Knowledge Discovery in Healthcare II: Towards Learning Healthcare Systems (KDH 2017).
Graph Classification via Deep Learning with Virtual Nodes Trang Pham, Truyen Tran, Hoa Dam, Svetha Venkatesh, Third Representation Learning for Graphs Workshop (ReLiG 2017).
Learning Recurrent Matrix Representation, Kien Do, Truyen Tran, Svetha Venkatesh.Third Representation Learning for Graphs Workshop (ReLiG 2017), also: arXiv preprint arXiv: 1703.01454.
Predicting healthcare trajectories from medical records: A deep learning approach,Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, Journal of Biomedical Informatics, April 2017, DOI: 10.1016/j.jbi.2017.04.001.
Deepr: A Convolutional Net for Medical Records, Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh, IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22–30, Jan. 2017, Doi: 10.1109/JBHI.2016.2633963.
Stabilizing Linear Prediction Models using Autoencoder, Shivapratap Gopakumara, Truyen Tran, Dinh Phung, Svetha Venkatesh, International Conference on Advanced Data Mining and Applications (ADMA 2016).
DeepCare: A Deep Dynamic Memory Model for Predictive Medicine, Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, PAKDD'16, Auckland, NZ, April 2016.
Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (e-NRBM), Truyen Tran, Tu D. Nguyen, D. Phung, and S. Venkatesh, Journal of Biomedical Informatics, 2015, doi:10.1016/j.jbi.2015.01.012.
Tensor-variate Restricted Boltzmann Machines, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, AAAI 2015.
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.

Monday 20 November 2017

Deep learning for biomedicine: A tutorial

I have dreamed big about AI for the future of healthcare.

Now, after just 9 months, it is happening at a fast rate. At the Asian Conference on Machine Learning this year (Nov, 2017) held in Seoul, Korea, I delivered a tutorial covering latest developments on the intersection at the most exciting topic of the day (Deep learning), and the most important topic of our time (Biomedicine).

The tutorial page with slides and references is here.

The time has come. Stay tuned.

Wednesday 8 March 2017

The Matrix Encoded: Matrices as first-class citizen

One thing as popular as hydrogen in the universe, is vector. Most mathematical and data analytical analysis asks for this fundamental structure of the world. PCA, ICA, SVM, GMM, t-SNE, neural nets to name a few, all implicitly assume vector representation of data. The power of vector should not be underestimated. The so-called distributed representation, which is rocking the machine learning and cognitive science worlds, is nothing but vector representation of thought (in Geoff Hinton's words, referring to Skip-Thought vectors).

The current love for distributed representation of things (yes, THINGS, as in Internet-of-Things) has gone really far. There is a huge line of work on [X]2vec, where you can substitute [X] by [word], [sentence],[paragraph], [document], [node], [tweet], [edge] and [subgraph]. I won't be surprised to see thing2vec very soon.

But can you really compress structured things like sentences into vectors? I bet you could, given that the vector is long enough. After all, although the space of all possible sentences in a language is theoretically infinite, the majority of language usage is tightly packed, and in practice the sentence space can be mapped into a linear space of thousands of dimensions.

However, compressing a data begs a question of decompressing it, e.g., to generate a target sentence in another language, as in machine translation. Surprisingly, the simplistic seq2seq trick works well in translation. But since the linguistic structures have been lost to vectorization, language generation from vector will be more difficult. A better way is to treat each sentence as a matrix, where each column is a word embedding. This gives rise to the attention scheme in machine translation, which turns out to a huge success, as in the current Google's Neural Machine Translation system.

Indeed, it has been well-recognized that vectors alone are not enough to memorize long-distant events. The idea is to augment vector-based RNN with an external memory, giving rise to the recent Memory-augmented RNNs. The external memory is nothing but a matrix.

Enter the world of matrices

Matrices in vector space are used for linear transformation, that is, to map a vector from one space, to another vector in a different space. As a mathematical object, matrices have their own life, just like vectors, e.g., matrix calculus.

In NLP, it has been suggested that noun is a vector and adjective is really a matrix. The idea is cute, because adjective "acts" on noun, which will transform the meaning of the noun.

Matrices also form a basis for parameterization of neural layers. Hence a space of multilayered neural nets is a joint space of matrices.

Our recent paper titled "Matrix-centric neural networks" (co-authored with my PhD student, Kien Do and my boss, Professor Svetha Venkatesh) pushes the line of matrix thinking to the extreme. That is, matrices are fist-class citizen. They are no longer a collection of vectors. The input, hidden layers, and the output are all matrices. The RNNs is now a model of a sequence of input matrices and a sequence of output matrices. The internal memory (as in LSTM) is also a matrix, making it resemble the Memory-augmented RNNs.

To rephrase Geoff Hinton, we want a matrix representation of thought. Somehow, our neocortex looks like a matrix -- it is really a huge thin sheet of grey matter.

May be one day we will live in the space created by matrices.

References

Learning Deep Matrix Representations, Kien Do, Truyen Tran, Svetha Venkatesh. arXiv preprint arXiv: 1703.01454. The updated version (as of August 2017) covers graphs as special case!

Saturday 25 February 2017

Column bundle: a single model for multiple multipe

Supervised machine learning has a few recurring concepts: data instance, feature set and label. Often, a data instance has one feature set and one label. But there are situations when you have multi-[X], where X = instance, view (feature subset), or label. For example, in multiple instance learning, you have more then one instance, but only one label.

Things are getting interesting when you have multiple instances, multiple views and multiple labels at the same time. For example, a video clip can be considered as a set of video segments (instances), each of which has views (audio, visual frames and may be textual subtitle), and the clip has many tags (labels).

Enter Column Bundle (CLB), the latest invention in my group.

CLB makes use of the concept of columns in neocortex. In brain, neurons are arranged in thin mini-columns, each of which is thought to cover a small sensory area called receptive field. Mini-columns are bundled into super-columns, which are inter-connected to form the entire neocortex. In our previous work, this cute concept has been exploited to build a network of columns for collective classification. For CLB, columns are arranged in a special way:

There is one central column that serves as the main processing unit (CPU).
There are input mini-columns to read inputs for multiple parts (Input)
There are output mini-columns to generate labels (Output)
Mini-columns are only connected to the central column.

Columns are recurrent neural nets with skip-connections (e.g., Highway Net, Residual Net or LSTM). Input parts can be instances, or views. The difference is only at the feature mapping: different views are first mapped into the same space.

In a sense, it looks like a neural computer without a RAM.

References

On Size Fit Many: Column Bundle for Multi-X Learning, Trang Pham, Truyen Tran, Svetha Venkatesh. arXiv preprint, arXiv:1702.07021
Column Networks for Collective Classification, T Pham, T Tran, D Phung, S Venkatesh, AAAI'17

Sunday 19 February 2017

Living in the future: AI for healthcare

In a not-so-distant future, it will be a routine to chat to a machine and receive medical advice from it. In fact, many of us have done this - seeking advice from healthcare sites, asking questions online and being recommended for known answers by algorithms. The current wave of AI will only accelerate this trend.

Medicine is by large a discipline of information, where the knowledge power is very asymmetric between doctors and patients. Doctors do the job well because humans are all alike, so that cases can be documented in medical textbooks and findings can be shared in journal articles and validated by others. In other words, medical knowledge is statistical, leading to the so-called evidence-based medicine (EBM). And this is exactly the reason why the current breed of machine learning - deep learning - will do well in majority of cases.

Predictive medicine

In Yann LeCun's words, the future of AI rests on predictive learning, which is basically an alternative way to say unsupervised learning. Technically, this is the capability to fill the missing slots. For those who are familiar with probabilistic graphical models, it is akin to computing pseudo-likelihood, or estimating values of some variables given the rest.

A significant part of medicine is inherently predictive. One is diagnosis - finding out what is happening now, and the other prognosis - figuring out what will be happening if an action (or absence of action) is done. While it is fair to say diagnosis is quite advanced, prognosis has a long way to go.

To my surprise as a machine learning practitioner, doctors are unreasonably poor at prediction into the future, especially when it comes to mental health and genomics. Doctors are, however, excellent in explaining the results after-the-fact. In machine learning's terms, their models can practically fit anything but do not generalize well. This must come from the culture of know-it-all, where medical knowledge is limited to only a handful of people, and doctors are obliged to explain what has happened to the poor patients.

Physical health

Human body is a physical (and to some extent, a statistical) system. Hence it follows physical laws. Physiological processes, in theory, can be fully understood and predictable - at least in a close environment. What are hard to predict, are the (results of) interactions with the open environment. For example, virus infection and car accidents are those hardly predictable. Hence, physical health is predictable up to an accuracy limit, beyond which computers have no hope in predicting. So don't expect the performance to be close to that we have seen in object recognition.

Mental health

Mental health is hard. No one can really tell what happens inside your brain, even if you have it opened. With hundreds of billions neurons and tens of trillions connections between them that give rise to mental processes, the complexity of the brain is beyond human reach at present. But mental health never goes alone. It goes hand-in-hand with physical health. A poor physical condition is likely to worsen a mental condition, and vice versa.

A good sign is that mental health is going computational. There is an emerging field called Computational Psychiatry. They are surprisingly open to new technological ideas.

The future

AI is also eating the healthcare stage with hundreds of startups popping up each month around the world. So what to expect in the near future within 5 years?

Medical imaging diagnosis. This is perhaps the most ready space due to the availability of affordable imaging options (CT-Scan, ultra-sound, fMRI, etc) and recent advances in computer vision, thanks to convolutional nets. One interesting form is microscopy imaging diagnosis since getting images from microscopes can be quite cheap. Another one is facial diagnosis -- It turns out, many diseases manifest through facial expression.
Medical text to be better understood. There are several types of text: doctor narrative in medical records, user-generated medical text online, social health places, and medical research articles. This field will take more time to take off, but given the high concentration of talents in NLP at present, we have a reason to hope.
Cheap, fast sequencing techniques. Sequencing cost has come down to a historic milestone of $1,000 recently, and we still have reasons to believe that it will go down to $100 in a not far future. For example, nanopore sequencing is emerging, and the sequencing using signal processing will be improved significantly.
Faster and better understanding of genomics. Once the sequencing reaches a critical mass, the understanding of it will be accelerated by AI. Check out, for example, the work of this Toronto professor, Brendan Frey.
Clinical data sharing will remain a bottleneck for the years to come. Unless we have access to a massive clinical database, things will move very slowly in clinical settings. But machine learning will have to work in data efficiency regimes, too.

Beyond 5 years, it is far more difficult to predict. Some are still in the realm of sci-fi.

Automation of drug discovery. Drug chemical and biological properties will be estimated accurately by machine. The search for a drug given a desirable function will be accelerated by hundred times.
A full dialog system for diagnosis and treatment recommendation. You don't need to see doctor for a $100 consultation for just 10 mins. You want a thorough consultation for free.
M-health, with distant robotic surgery.
Brain-machine interfacing, where humans will rely on machine for high bandwidth communication. This idea is from my favorite technologist Elon Musk.
Nano chips will enter the body in millions and kill the nasty bugs, fix the damages and get out without being kicked out by the immune system. This idea is from the 2006 book The Singularity is Near by my favorite futurist Ray Kurzweil.
Robot doctors will be licensed, just like self-driving cars now.
Patients will be in control. No more know-it-all doctors. Patients will have a full knowledge of their own health. This implies that things must be explainable, and patients must be educated about their own bio & mental.

However, like everything else, it is easy to imagine than done. Don't forget that AI in Medicine (AIIM) is a very old journal, and nothing really magic has happened yet.

What we do

At PRaDA (Deakin University, Australia), we have our own share in this space. Some most recent contributions are (as of 07/09/2018):

Healthcare processes as sequence of sets (2018), where we model the dynamic interaction between diseases and treatments.
Healthcare as Turing computational (2018), where we show that health processes can be modeled as a probabilistic Turing machine. Also see here.
Drug multiple repurposing (2018), where we predict the effect of a drug on multiple targets (proteins and diseases).
Predicting drug response from molecular structure (2017), where we use molecular structure to compute a drug representation, which is then used for predicting its bioactivity given a disease. UPDATE (07/09/2018): new version is here.
Attend to temporal ICU risk (2017), where we figure out a way to deal with ICU time-series, which are irregular and mostly missing. Again, the work will be in the public domain soon.
Matrix-LSTM (2017) for EEG, where we capture the tensor-like nature of EEG signals over time.
DeepCare (2016), where we model the course of health trajectory, which is occasionally intervened at irregular time.
Deepr (2016), where we aim to discover explainable predictive motifs though CNN.
Anomaly detection (2016), where we discover outliers in healthcare data, which is inherently mixed-type.
Stable risk discovery through Autoencoder (2016), where we discover structure among risk factors.
Generating stable prediction rules (2016), where we demonstrate that simple, and statistically stable rules can be uncovered from lots of administrative data for preterm-birth prediction at 25 weeks of gestation.
eNRBM (2015): understanding the group formation of medical concepts through competitive learning and prior medical knowledge.

Thursday 12 January 2017

On expressiveness, learnability and generalizability of deep learning

Turing machine (aturingmachine.com)

It is a coincidence that Big Data and Deep Learning popped up at the same time, roughly around 2012. And it is told that data to deep learning is fuel to rockets (this line is often attributed to Andrew Ng, co-founder of Coursera and Chief Scientist at Baidu).

It is true that current deep learning flourishes as it leverages big, complex data better than existing techniques. Equipped with advances in hardware (GPU, HPC), deep learning applications are more powerful and useful than ever. However, without theoretical advances, big data might have remained a big pile of junk artifacts.

Let us examine three key principles to any learning system: expressiveness, learnability and generalizability, and see how deep learning fits in.

Expressiveness

This requires learning system that can:

Represent the complexity of the world. It was proved in early 1990s that feedforward nets are universal function approximator. It means that any function imaginable can be represented by a suitable neural network. Note that convolutional nets are also feedforward net which represents a function that maps an image to any target values.
Compute anything computable. Roughly the same time, it was proved that recurrent nets are Turing-complete. It says that any program written down in a standard computer can be represented by a suitable recurrent neural network (RNN). It is even suggested that Turing machines (and even human brains) are indeed RNN.

These two theoretical guarantees are powerful enough to enable any computable applications, from object recognition to video understanding to automated translation to conversational agents to automated programmers. For example, one biggest challenge set out by OpenAI is to write a program that wins all programming challenges.

Learnability

But merely proving that there exists a neural net to do a job does not mean that we can find the net within a budget of time, unless there are efficient ways to do so. In a supervised learning setting, learnability means at least three things:

Have a correct computational graph that enables effective and efficient passing of information and gradient between inputs and outputs. Finding a near-optimal graph is the job of architecture engineering, which is rather an art than a science. This is because the space of architectures are exponentially large, if not infinite. A right graph helps at least two things: (i) essential information is captured, and (ii) information passing is much easier. For example, convolutional nets allow translation invariance, which is often seen in images, speech and signals. With parameter sharing and the pyramid structure, training signals are distributed evenly between layers, and even weak signals at each image patch can multiply, enabling easier learning. And current skip-connections allow much easier passing of information across hundreds of layers.
Have flexible optimizers to navigate the rugged landscape of objective functions. Complex computational graphs are generally non-convex, meaning it is usually impossible to find the global optima in limited time. Fortunately, adaptive stochastic gradient descents are fairly efficient, including Adam, AdaDelta, RMSProp, etc. They can find good local minima in less than a hundred of passes through data.
Have enough data to statistically cover all small variations in reality. Practically it means hundreds of thousand data points for moderate problems, and millions for complex problems. An immediate corollary is the need to have very powerful compute, which usually means lots of GPUs, RAM, time and patience.

Generalizability

Having a capacity to learn any function or program is not enough. The learnt program must be able to generalize to unseen data as expected. Overfitting easily occurs in modern models where millions of parameters are common. Fortunately, with lots of data, overfitting is less a problem. Also recent advances have introduced Dropout (and its cousin like Maxout, DropConnect, stochastic layers) and Batch-Norm, and they together help reduce overfitting significantly.

This is evidenced in deep nets systems that work in the wild (Self-driving cars, Google Translate/Voice, AlphaGo).

Of course, these three concepts are not enough to make deep learning work in practice. There are hundreds of models, techniques, and programming frameworks out there to make things happen.

Tuesday 27 December 2016

Deep learning as new electronics

It is hard to imagine a modern life without electronics: radios, TVs, microwaves, mobile phones and many more gadgets. Dump or smart, they are all based on the principles of semi-conducting and electromagnetism. Now we are using these devices for granted without worrying about these underlying laws of physics. Most people do not care about circuits that run in chips and carry out most functions of the devices.

For the past 5 years, a new breed of human-like functionalities has emerged through advances of a new field called deep learning: self-driving cars, voice command in mobile phone, translation in hundreds of language pairs and a new kind of art. In 2016, ten years after its revival, deep learning has taken over the Internet. People have used deep learning-powered products in daily life without worrying about how the underlying neural nets work.

These two fields free us from many physical and psychological constraints:

Electronic devices give us freedom of communication over distance, a new kind of experiences with augmented reality and many more.
Deep learning enables freedom from having to make tedious and incorrect decisions (e.g., driving a car), freedom of information access (personalization), of hand (e.g., voice command), of finance (automated trading), of feature extraction (through representation learning), and many more.

It is worth noting that electronics and deep learning are different in principles.

Electronics devices are designed with great precision for specific functions in mind. Imprecision comes from the quantum uncertainty principle and thermal fluctuations.
Neural nets on the other hand, are designed to learn to perform a function of its own, where data (and sometimes model) uncertainty is built in.

However, it is also striking that they are quite similar in many ways.

Super-city of interconnected simple parts

Modern electronic devices are truly super-cities built out of just few kinds of primitive building blocks. The same holds for deep neural nets:

Electronic primitives: resistor, capacitor, transistor, coil, diode, logic gate and switch.
Neural net primitives: integrate-and-fire neuron, multiplicative gating, differentiable logic gate, switch and attention module. Interestingly, one of the most recent idea is called "Highway networks", borrowing the idea that highway traffic is free of traffic lights.

These primitives are connected in graphs:

Electronic devices works by moving electrons in correct order and number. The force that makes them move is potential difference. A design circuit captures all necessary information.
In neural nets, the activation function is like the electronic current. The main difference is that the magnitude of "current" in neural nets can be learnt. A computational graph is what is needed for model execution.

Not just analogy: A two-way relationship

Electronics → deep learning: At present, advances in electronics have given huge boost in efficiency of deep learning with GPU, TPU and other initiatives. It is interesting to see if we can learn from electronics in designing deep nets? For example, will something analogous to integrated-circuits in deep architectures?
Deep learning → electronics: I predict that soon the reverse will hold true: deep learning will play a great role in improving efficiency and functionalities of electronic devices. Stay tuned.