Monday, 20 November 2017
I have dreamed big about AI for the future of healthcare.
Now, after just 9 months, it is happening at a fast rate. At the Asian Conference on Machine Learning this year (Nov, 2017) held in Seoul, Korea, I delivered a tutorial covering latest developments on the intersection at the most exciting topic of the day (Deep learning), and the most important topic of our time (Biomedicine).
The tutorial page with slides and references is here.
The time has come. Stay tuned.
Wednesday, 8 March 2017
One thing as popular as hydro in the universe, is vector. Most mathematical and data analytical analysis asks for this fundamental structure of the world. PCA, ICA, SVM, GMM, t-SNE, neural nets to name a few, all implicitly assume vector representation of data. The power of vector should not be underestimated. The so-called distributed representation, which is rocking the machine learning and cognitive science worlds, is nothing but vector representation of thought (in Geoff Hinton's words, referring to Skip-Thought vectors).
The current love for distributed representation of things (yes, THINGS, as in Internet-of-Things) has gone really far. There is a huge line of work on [X]2vec, where you can substitute [X] by [word], [sentence],[paragraph], [document], [node], [tweet], [edge] and [subgraph]. I won't be surprised to see thing2vec very soon.
But can you really compress structured things like sentences into vectors? I bet you could, given that the vector is long enough. After all, although the space of all possible sentences in a language is theoretically infinite, the majority of language usage is tightly packed, and in practice the sentence space can be mapped into a linear space of thousands of dimensions.
However, compressing a data begs a question of decompressing it, e.g., to generate a target sentence in another language, as in machine translation. Surprisingly, the simplistic seq2seq trick works well in translation. But since the linguistic structures have been lost to vectorization, language generation from vector will be more difficult. A better way is to treat each sentence as a matrix, where each column is a word embedding. This gives rise to the attention scheme in machine translation, which turns out to a huge success, as in the current Google's Neural Machine Translation system.
Indeed, it has been well-recognized that vectors alone are not enough to memorize long-distant events. The idea is to augment vector-based RNN with an external memory, giving rise to the recent Memory-augmented RNNs. The external memory is nothing but a matrix.
Enter the world of matrices
Matrices in vector space are used for linear transformation, that is, to map a vector from one space, to another vector in a different space. As a mathematical object, matrices have their own life, just like vectors, e.g., matrix calculus.
In NLP, it has been suggested that noun is a vector and adjective is really a matrix. The idea is cute, because adjective "acts" on noun, which will transform the meaning of the noun.
Matrices also form a basis for parameterization of neural layers. Hence a space of multilayered neural nets is a joint space of matrices.
Our recent paper titled "Matrix-centric neural networks" (co-authored with my PhD student, Kien Do and my boss, Professor Svetha Venkatesh) pushes the line of matrix thinking to the extreme. That is, matrices are fist-class citizen. They are no longer a collection of vectors. The input, hidden layers, and the output are all matrices. The RNNs is now a model of a sequence of input matrices and a sequence of output matrices. The internal memory (as in LSTM) is also a matrix, making it resemble the Memory-augmented RNNs.
To rephrase Geoff Hinton, we want a matrix representation of thought. Somehow, our neocortex looks like a matrix -- it is really a huge thin sheet of grey matter.
May be one day we will live in the space created by matrices.
- Matrix-Centric Neural Networks, Kien Do, Truyen Tran, Svetha Venkatesh. arXiv preprint arXiv: 1703.01454.
Saturday, 25 February 2017
Supervised machine learning has a few recurring concepts: data instance, feature set and label. Often, a data instance has one feature set and one label. But there are situations when you have multi-[X], where X = instance, view (feature subset), or label. For example, in multiple instance learning, you have more then one instance, but only one label.
Things are getting interesting when you have multiple instances, multiple views and multiple labels at the same time. For example, a video clip can be considered as a set of video segments (instances), each of which has views (audio, visual frames and may be textual subtitle), and the clip has many tags (labels).
Enter Column Bundle (CLB), the latest invention in my group.
CLB makes use of the concept of columns in neocortex. In brain, neurons are arranged in thin mini-columns, each of which is thought to cover a small sensory area called receptive field. Mini-columns are bundled into super-columns, which are inter-connected to form the entire neocortex. In our previous work, this cute concept has been exploited to build a network of columns for collective classification. For CLB, columns are arranged in a special way:
- There is one central column that serves as the main processing unit (CPU).
- There are input mini-columns to read inputs for multiple parts (Input)
- There are output mini-columns to generate labels (Output)
- Mini-columns are only connected to the central column.
Columns are recurrent neural nets with skip-connections (e.g., Highway Net, Residual Net or LSTM). Input parts can be instances, or views. The difference is only at the feature mapping: different views are first mapped into the same space.
In a sense, it looks like a neural computer without a RAM.
- On Size Fit Many: Column Bundle for Multi-X Learning, Trang Pham, Truyen Tran, Svetha Venkatesh. arXiv preprint, arXiv:1702.07021
- Column Networks for Collective Classification, T Pham, T Tran, D Phung, S Venkatesh, AAAI'17
Sunday, 19 February 2017
In a not-so-distant future, it will be a routine to chat to a machine and receive medical advice from it. In fact, many of us have done this - seeking advice from healthcare sites, asking questions online and being recommended for known answers by algorithms. The current wave of AI will only accelerate this trend.
Medicine is by large a discipline of information, where the knowledge power is very asymmetric between doctors and patients. Doctors do the job well because humans are all alike, so that cases can be documented in medical textbooks and findings can be shared in journal articles and validated by others. In other words, medical knowledge is statistical, leading to the so-called evidence-based medicine (EBM). And this is exactly the reason why the current breed of machine learning - deep learning - will do well in majority of cases.
In Yann LeCun's words, the future of AI rests on predictive learning, which is basically an alternative way to say unsupervised learning. Technically, this is the capability to fill the missing slots. For those who are familiar with probabilistic graphical models, it is akin to computing pseudo-likelihood, or estimating values of some variables given the rest.
A significant part of medicine is inherently predictive. One is diagnosis - finding out what is happening now, and the other prognosis - figuring out what will be happening if an action (or absence of action) is done. While it is fair to say diagnosis is quite advanced, prognosis has a long way to go.
To my surprise as a machine learning practitioner, doctors are unreasonably poor at prediction into the future, especially when it comes to mental health and genomics. Doctors are, however, excellent in explaining the results after-the-fact. In machine learning's terms, their models can practically fit anything but do not generalize well. This must come from the culture of know-it-all, where medical knowledge is limited to only a handful of people, and doctors are obliged to explain what has happened to the poor patients.
Human body is a physical (and to some extent, a statistical) system. Hence it follows physical laws. Physiological processes, in theory, can be fully understood and predictable - at least in a close environment. What are hard to predict, are the (results of) interactions with the open environment. For example, virus infection and car accidents are those hardly predictable. Hence, physical health is predictable up to an accuracy limit, beyond which computers have no hope in predicting. So don't expect the performance to be close to that we have seen in object recognition.
Mental health is hard. No one can really tell what happens inside your brain, even if you have it opened. With hundreds of billions neurons and tens of trillions connections between them that give rise to mental processes, the complexity of the brain is beyond human reach at present. But mental health never goes alone. It goes hand-in-hand with physical health. A poor physical condition is likely to worsen a mental condition, and vice versa.
A good sign is that mental health is going computational. There is an emerging field called Computational Psychiatry. They are surprisingly open to new technological ideas.
AI is also eating the healthcare stage with hundreds of startups popping up each month around the world. So what to expect in the near future within 5 years?
- Medical imaging diagnosis. This is perhaps the most ready space due to the availability of affordable imaging options (CT-Scan, ultra-sound, fMRI, etc) and recent advances in computer vision, thanks to convolutional nets. One interesting form is microscopy imaging diagnosis since getting images from microscopes can be quite cheap. Another one is facial diagnosis -- It turns out, many diseases manifest through facial expression.
- Medical text to be better understood. There are several types of text: doctor narrative in medical records, user-generated medical text online, social health places, and medical research articles. This field will take more time to take off, but given the high concentration of talents in NLP at present, we have a reason to hope.
- Cheap, fast sequencing techniques. Sequencing cost has come down to a historic milestone of $1,000 recently, and we still have reasons to believe that it will go down to $100 in a not far future. For example, nanopore sequencing is emerging, and the sequencing using signal processing will be improved significantly.
- Faster and better understanding of genomics. Once the sequencing reaches a critical mass, the understanding of it will be accelerated by AI. Check out, for example, the work of this Toronto professor, Brendan Frey.
- Clinical data sharing will remain a bottleneck for the years to come. Unless we have access to a massive clinical database, things will move very slowly in clinical settings. But machine learning will have to work in data efficiency regimes, too.
Beyond 5 years, it is far more difficult to predict. Some are still in the realm of sci-fi.
- Automation of drug discovery. Drug chemical and biological properties will be estimated accurately by machine. The search for a drug given a desirable function will be accelerated by hundred times.
- A full dialog system for diagnosis and treatment recommendation. You don't need to see doctor for a $100 consultation for just 10 mins. You want a thorough consultation for free.
- M-health, with distant robotic surgery.
- Brain-machine interfacing, where humans will rely on machine for high bandwidth communication. This idea is from my favorite technologist Elon Musk.
- Nano chips will enter the body in millions and kill the nasty bugs, fix the damages and get out without being kicked out by the immune system. This idea is from the 2006 book The Singularity is Near by my favorite futurist Ray Kurzweil.
- Robot doctors will be licensed, just like self-driving cars now.
- Patients will be in control. No more know-it-all doctors. Patients will have a full knowledge of their own health. This implies that things must be explainable, and patients must be educated about their own bio & mental.
However, like everything else, it is easy to imagine than done. Don't forget that AI in Medicine (AIIM) is a very old journal, and nothing really magic has happened yet.
What we do
At PRaDA (Deakin University, Australia), we have our own share in this space. Some most recent contributions are:
- Predicting drug response from molecular structure (2017), where we use molecular structure to compute a drug representation, which is then used for predicting its bioactivity given a disease.
- Attend to temporal ICU risk (2017), where we figure out a way to deal with ICU time-series, which are irregular and mostly missing. Again, the work will be in the public domain soon.
- Matrix-LSTM (2017) for EEG, where we capture the tensor-like nature of EEG signals over time.
- DeepCare (2016), where we model the course of health trajectory, which is occasionally intervened at irregular time.
- Deepr (2016), where we aim to discover explainable predictive motifs though CNN.
- Anomaly detection (2016), where we discover outliers in healthcare data, which is inherently mixed-type.
- Stable risk discovery through Autoencoder (2016), where we discover structure among risk factors.
- Generating stable prediction rules (2016), where we demonstrate that simple, and statistically stable rules can be uncovered from lots of administrative data for preterm-birth prediction at 25 weeks of gestation.
- eNRBM (2015): understanding the group formation of medical concepts through competitive learning and prior medical knowledge.
Thursday, 12 January 2017
|Turing machine (aturingmachine.com)|
It is true that current deep learning flourishes as it leverages big, complex data better than existing techniques. Equipped with advances in hardware (GPU, HPC), deep learning applications are more powerful and useful than ever. However, without theoretical advances, big data might have remained a big pile of junk artifacts.
Let us examine three key principles to any learning system: expressiveness, learnability and generalizability, and see how deep learning fits in.
This requires learning system that can:
- Represent the complexity of the world. It was proved in early 1990s that feedforward nets are universal function approximator. It means that any function imaginable can be represented by a suitable neural network. Note that convolutional nets are also feedforward net which represents a function that maps an image to any target values.
- Compute anything computable. Roughly the same time, it was proved that recurrent nets are Turing-complete. It says that any program written down in a standard computer can be represented by a suitable recurrent neural network (RNN). It is even suggested that Turing machines (and even human brains) are indeed RNN.
These two theoretical guarantees are powerful enough to enable any computable applications, from object recognition to video understanding to automated translation to conversational agents to automated programmers. For example, one biggest challenge set out by OpenAI is to write a program that wins all programming challenges.
But merely proving that there exists a neural net to do a job does not mean that we can find the net within a budget of time, unless there are efficient ways to do so. In a supervised learning setting, learnability means at least three things:
- Have a correct computational graph that enables effective and efficient passing of information and gradient between inputs and outputs. Finding a near-optimal graph is the job of architecture engineering, which is rather an art than a science. This is because the space of architectures are exponentially large, if not infinite. A right graph helps at least two things: (i) essential information is captured, and (ii) information passing is much easier. For example, convolutional nets allow translation invariance, which is often seen in images, speech and signals. With parameter sharing and the pyramid structure, training signals are distributed evenly between layers, and even weak signals at each image patch can multiply, enabling easier learning. And current skip-connections allow much easier passing of information across hundreds of layers.
- Have flexible optimizers to navigate the rugged landscape of objective functions. Complex computational graphs are generally non-convex, meaning it is usually impossible to find the global optima in limited time. Fortunately, adaptive stochastic gradient descents are fairly efficient, including Adam, AdaDelta, RMSProp, etc. They can find good local minima in less than a hundred of passes through data.
- Have enough data to statistically cover all small variations in reality. Practically it means hundreds of thousand data points for moderate problems, and millions for complex problems. An immediate corollary is the need to have very powerful compute, which usually means lots of GPUs, RAM, time and patience.
Having a capacity to learn any function or program is not enough. The learnt program must be able to generalize to unseen data as expected. Overfitting easily occurs in modern models where millions of parameters are common. Fortunately, with lots of data, overfitting is less a problem. Also recent advances have introduced Dropout (and its cousin like Maxout, DropConnect, stochastic layers) and Batch-Norm, and they together help reduce overfitting significantly.
This is evidenced in deep nets systems that work in the wild (Self-driving cars, Google Translate/Voice, AlphaGo).
Of course, these three concepts are not enough to make deep learning work in practice. There are hundreds of models, techniques, and programming frameworks out there to make things happen.
Tuesday, 27 December 2016
It is hard to imagine a modern life without electronics: radios, TVs, microwaves, mobile phones and many more gadgets. Dump or smart, they are all based on the principles of semi-conducting and electromagnetism. Now we are using these devices for granted without worrying about these underlying laws of physics. Most people do not care about circuits that run in chips and carry out most functions of the devices.
For the past 5 years, a new breed of human-like functionalities has emerged through advances of a new field called deep learning: self-driving cars, voice command in mobile phone, translation in hundreds of language pairs and a new kind of art. In 2016, ten years after its revival, deep learning has taken over the Internet. People have used deep learning-powered products in daily life without worrying about how the underlying neural nets work.
These two fields free us from many physical and psychological constraints:It is worth noting that electronics and deep learning are different in principles.
- Electronic devices give us freedom of communication over distance, a new kind of experiences with augmented reality and many more.
- Deep learning enables freedom from having to make tedious and incorrect decisions (e.g., driving a car), freedom of information access (personalization), of hand (e.g., voice command), of finance (automated trading), of feature extraction (through representation learning), and many more.
- Electronics devices are designed with great precision for specific functions in mind. Imprecision comes from the quantum uncertainty principle and thermal fluctuations.
- Neural nets on the other hand, are designed to learn to perform a function of its own, where data (and sometimes model) uncertainty is built in.
Super-city of interconnected simple parts
Modern electronic devices are truly super-cities built out of just few kinds of primitive building blocks. The same holds for deep neural nets:
Modern electronic devices are truly super-cities built out of just few kinds of primitive building blocks. The same holds for deep neural nets:
- Electronic primitives: resistor, capacitor, transistor, coil, diode, logic gate and switch.
- Neural net primitives: integrate-and-fire neuron, multiplicative gating, differentiable logic gate, switch and attention module. Interestingly, one of the most recent idea is called "Highway networks", borrowing the idea that highway traffic is free of traffic lights.
These primitives are connected in graphs:
- Electronic devices works by moving electrons in correct order and number. The force that makes them move is potential difference. A design circuit captures all necessary information.
- In neural nets, the activation function is like the electronic current. The main difference is that the magnitude of "current" in neural nets can be learnt. A computational graph is what is needed for model execution.
Not just analogy: A two-way relationship
- Electronics → deep learning: At present, advances in electronics have given huge boost in efficiency of deep learning with GPU, TPU and other initiatives. It is interesting to see if we can learn from electronics in designing deep nets? For example, will something analogous to integrated-circuits in deep architectures?
- Deep learning → electronics: I predict that soon the reverse will hold true: deep learning will play a great role in improving efficiency and functionalities of electronic devices. Stay tuned.
Sunday, 25 December 2016
Neil Lawrence had an interesting observation about the current state of machine learning, and linked it to fast ball games:
“[…] the dynamics of the game will evolve. In the long run, the right way of playing football is to position yourself intelligently and to wait for the ball to come to you. You’ll need to run up and down a bit, either to respond to how the play is evolving or to get out of the way of the scrum when it looks like it might flatten you.”Neil Lawrence is known for his work in Gaussian Processes and is a proponent of data efficiency. He used to be professor at University of Sheffield, is now with Amazon. Apparently the strategy works. The ball has come to him.
I once heard about a professor who said he would come to top conferences just to learn what others were busy doing and tried to do something else.
I also read somewhere from a top physicist that students who applied to work with him often expressed the wish to study shiny-and-clean fields. Some other fields were too messy and seemed unsexy. The professor insisted that the messy fields were exactly the best to work on.
In "Letters to a young scientist", Edward Osborne Wilson told his life story. He spent his entire life cataloging ants since childhood, right at the time where ant ecology wasn't a shiny field. He is considered as father of biodiversity.
Wonder what to do in deep learning now?
It is an extremely fast ball game with thousands of top players. You will be either crushed with ideas being stolen weekly, or out of steam pretty quickly.
It looks like most of the low hanging fruits have been picked.
Then ask yourself, what is your unique position? What are your strengths and advantages that people do not have? Can you move faster than others? It may be by having access to data, access to expertise in the neighborhood, or borrowing angles outside the field. Sometimes digging up old ideas is highly beneficial, too.
Alternatively, just calm down, and do boring-but-important stuffs. Important problems are like the goal areas in ball games. The ball will surely come.