Machine learns as data speak: July 2012

This post series is about analysing multiple data types. By types, we mean primitive natures such as real-valued, interval, counts, rank, binary, ordinal, categorical and multicategorical types. It is more detailed than simply saying data is numerical or nominal. For each of these types, we can still subdivide into finer sub-types, depending on the nature of the generating processes. For example, real-valued variables can be Gaussian or logistic (mean-value), Gumbel (extreme-value), or rate (exponential), etc.

Type handling is indeed important in many statistical application areas such as healthcare, business and engineering. There are good reasons for performing adequate type modelling and analysis. First, interpretation and insights are often important goals, sometimes more important than final (predictive) performance. Thus we need to understand the generative process that governs the generation of data types. For example, modelling a queuing process is critical in many applications, ranging from networking to manufacturing to supermarket scheduling. The number of arrivals within a period of time is often approximated by a Poisson distribution while the time between two arrivals by an exponential distribution. In collaborative filtering, ordinal modelling of ratings should be preferred to Gaussian treatments, because it offers insights of why and how a particular rating is chosen for a particular item by a particular user.

Second, even if predictive performance is the main goal, it is often the case that accurate modelling of data types would make it easier to achieve the goal.

Type-specific phenomena have been very well-studied, and now it is a good time to pay more attention to mixed types because real world processes are rarely isolated and they are likely to contain more than one type. Indeed, the problem is old, and there is a rich literature on modelling mixed data, but they often involve two or three types (e.g., continuous and categorical). Handling arbitrarily many types seems not to be adequately addressed in the existing literature. This series of post aim to partly fill that gap.

As an example of the situation when multiple data types appear simultaneously, let us take a typical multimedia object, where we can always have multiple ways of viewing it:

If it is described by a set of tags, then the type can be binary (we care about the presence and appearance of a word) or categorical (we care about why a particular word is present and not the others),
If it is described by a bag of visual words, then the type can be counts, or repeated categorical.
If it is described by a vector of histograms, then the type is real-valued.

Indeed, in the publicly available dataset like NUS-WIDE, all of these types are present, and I am not aware of any work in multimedia or machine learning making use of these types explicitly.

In healthcare, a patient can be characterised by various clinical and background information: number of previous admissions (count), age (continuous, or ordinal as in interval), height (continuous), test results (binary outcome or continuous measures), etc. A full account should call for an effective treatment of all the types at the same time. However, it seems that the development is till in an early stage.

In social sciences and business, surveys are one of the primary tools to measure the opinions and interests of people. The scale and impact are huge: Millions of surveys are conducted on millions of people each year. In a typical survey, a mixed bag of question types is used, and the answers often contain stated facts, preferences and choices. These can then be translated into the types such as binary (yes or no), counts (previous admissions), rank ( of political candidates), ordinal (strongly dislike, dislike, like), pairwise preferences, etc. Unfortunately, the literature of survey analysis is fully of advice about statistics for each data type and not so much about simultaneous modelling and analysis of all of these types.

Type-specific analysis in machine learning

Type modelling seems like an unsexy topic to many machine learning people. In the early days, types are explicitly handled in decision trees (e.g., popular in the 80s and early 90s): Nodes are split depending critically on their types, e.g., real-valued versus nominal. However, decision trees seem to be the problem of the day before yesterday (this is not to say they are not important anymore -- on the contrary, ensemble of tress is still one of the most competitive methods in predictive arts and they are widely used in practical data mining). There was not much interest in other developments, e.g., with neural networks, perceptrons and linear models or the more recent kernel methods.

While it is OK in practice to convert multiple data types into some unified form such as real-valued or binary (the process is also known as data coding), it may be more desirable to deal with data types directly. When the types are in the input, the coding may or may not reduce the amount of predictive information because coding can sometimes help to linearize the data, making easier for linear algorithms to do their job. However, if we want to understand the structure of the input (e.g., as in semi-supervised, self-taught and transfer learning), ignoring the data types can be problematic because the coding process partially destroys the type-specific information. For example, in a bag-of-word representation of a document, it may be more natural to deal with the counting using Poisson models than converting counts into real-values (which are an approximation), binary indicators (which lose information) or replicated binary indicators (which increase the model size).

When the types are in the output, there have been two directions: One of to model the type directly, and the reductionist approach is to convert type-specific problems into the more popular binary problems.

The former case is quite interesting, and in doing so, ML would invent a few more terminologies along the way:

For continuous outputs, the natural solution is using some regression techniques, such as ridge regression. However, this assumes the error structure is normal while it is not always necessarily so.
For binary outputs, this is really the canonical problem in machine learning.
For ordinal outputs, the problem is often called ordinal regression.
For counting outputs, a standard solution is Poisson regression.
For categorical outputs, this is often called multiclass problems, and now there are many solutions, under different names, e.g., multiclass SVM, maximum entropy, multinomial logit, multinomial probit.
For rank outputs, the problem is called label ranking.
For multiple binary outputs, the problem is often known as multilabel learning.

The main contribution from ML is the introduction of the notion of margins and the analysis of generalization and stability of algorithms, as well as specific algorithms such as decision trees, neural networks, boosting, random forests.

In the latter case, where we transform the complex problems into simpler ones, we may convert ordinal regression and ranking into a series of binary classifications from which existing methods such as SVMs can be reused. This can be an useful way since we don't have to invent new ML algorithms, this may also make the problem harder than necessary because we need to recombine binary predictors into the original form.

With regard to mixed-type analysis, its presence in machine learning is very limited, and I am only aware of a couple of attempts: here and here. The latter is ours, and we will cover in the subsequent posts.

Some research challenges

An important goal of mixed type modelling is to derive a joint distribution of all types. Ultimately, we would want a representation scheme that aids understanding of interaction among types, and at the same time, supports efficient inference (e.g., estimating marginals or expectations) and learning (e.g., structure and parameters). Probabilistic graphical models would be an excellent candidate here.

Sometimes, a full joint distribution is not needed if we care about prediction of some output variables given the rest. In machine learning, mixed-type outputs would be an instance of multitask learning where each output is a task. Thus, estimating a shared structure between output types (given arbitrary input types) would be an interesting direction here.

In some applications, on the other hand, joint distribution modelling might not be the ultimate goal. For example, we would need just a similarity measure between two mixed-type instances (e.g., in k-nearest neighbour classification, k-means clustering and kernel-methods), or a good visualisation of all instances in 2D (this often boils down to similarity estimation). This is a challenging problem because each type would lead to a measure which may not compatible with another in scale or in interpretation.

And finally, often we want to make use of some handy data analysis tools such as PCA/SVD/ICA/CCA/SVM/GLM on top of mixed data. Since these tools do not work on mixed data, the problem now is to turn mixed data into some real-valued vector representations which preserve as much as information of the original mixed-data as possible. In other words, we now have a problem of representation learning from mixed-data.

Can these challenges be overcome by a unified framework? In the subsequent posts, we would attempt to provide some early clues. Stay tuned.

Updated references:

Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011), Taoyuan, Taiwan, Nov 2011.
Embedded Restricted Boltzmann Machines for Fusion of Mixed Data Types and Applications in Social Measurements Analysis, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 15-th International Conference on Information Fusion (FUSION-12), Singapore, July 2012.
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.
Thurstonian Boltzmann machines: Learning from multiple inequalities, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 30th International Conference in Machine Learning (ICML’13), Atlanta, USA, June, 2013.
Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of IEEE International Conference on Multimedia and Expo (ICME), San Jose, California, USA, July 2013.
Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh,International Conference on Advanced Data Mining and Applications (ADMA 2016).
Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, arXiv preprint arXiv: 1610.06249.

Machine learns as data speak

Saturday 7 July 2012

Mixed-type data analysis I: An overview