Machine learns as data speak: 2012

Saturday 7 July 2012

Mixed-type data analysis I: An overview

This post series is about analysing multiple data types. By types, we mean primitive natures such as real-valued, interval, counts, rank, binary, ordinal, categorical and multicategorical types. It is more detailed than simply saying data is numerical or nominal. For each of these types, we can still subdivide into finer sub-types, depending on the nature of the generating processes. For example, real-valued variables can be Gaussian or logistic (mean-value), Gumbel (extreme-value), or rate (exponential), etc.

Type handling is indeed important in many statistical application areas such as healthcare, business and engineering. There are good reasons for performing adequate type modelling and analysis. First, interpretation and insights are often important goals, sometimes more important than final (predictive) performance. Thus we need to understand the generative process that governs the generation of data types. For example, modelling a queuing process is critical in many applications, ranging from networking to manufacturing to supermarket scheduling. The number of arrivals within a period of time is often approximated by a Poisson distribution while the time between two arrivals by an exponential distribution. In collaborative filtering, ordinal modelling of ratings should be preferred to Gaussian treatments, because it offers insights of why and how a particular rating is chosen for a particular item by a particular user.

Second, even if predictive performance is the main goal, it is often the case that accurate modelling of data types would make it easier to achieve the goal.

Type-specific phenomena have been very well-studied, and now it is a good time to pay more attention to mixed types because real world processes are rarely isolated and they are likely to contain more than one type. Indeed, the problem is old, and there is a rich literature on modelling mixed data, but they often involve two or three types (e.g., continuous and categorical). Handling arbitrarily many types seems not to be adequately addressed in the existing literature. This series of post aim to partly fill that gap.

As an example of the situation when multiple data types appear simultaneously, let us take a typical multimedia object, where we can always have multiple ways of viewing it:

If it is described by a set of tags, then the type can be binary (we care about the presence and appearance of a word) or categorical (we care about why a particular word is present and not the others),
If it is described by a bag of visual words, then the type can be counts, or repeated categorical.
If it is described by a vector of histograms, then the type is real-valued.

Indeed, in the publicly available dataset like NUS-WIDE, all of these types are present, and I am not aware of any work in multimedia or machine learning making use of these types explicitly.

In healthcare, a patient can be characterised by various clinical and background information: number of previous admissions (count), age (continuous, or ordinal as in interval), height (continuous), test results (binary outcome or continuous measures), etc. A full account should call for an effective treatment of all the types at the same time. However, it seems that the development is till in an early stage.

In social sciences and business, surveys are one of the primary tools to measure the opinions and interests of people. The scale and impact are huge: Millions of surveys are conducted on millions of people each year. In a typical survey, a mixed bag of question types is used, and the answers often contain stated facts, preferences and choices. These can then be translated into the types such as binary (yes or no), counts (previous admissions), rank ( of political candidates), ordinal (strongly dislike, dislike, like), pairwise preferences, etc. Unfortunately, the literature of survey analysis is fully of advice about statistics for each data type and not so much about simultaneous modelling and analysis of all of these types.

Type-specific analysis in machine learning

Type modelling seems like an unsexy topic to many machine learning people. In the early days, types are explicitly handled in decision trees (e.g., popular in the 80s and early 90s): Nodes are split depending critically on their types, e.g., real-valued versus nominal. However, decision trees seem to be the problem of the day before yesterday (this is not to say they are not important anymore -- on the contrary, ensemble of tress is still one of the most competitive methods in predictive arts and they are widely used in practical data mining). There was not much interest in other developments, e.g., with neural networks, perceptrons and linear models or the more recent kernel methods.

While it is OK in practice to convert multiple data types into some unified form such as real-valued or binary (the process is also known as data coding), it may be more desirable to deal with data types directly. When the types are in the input, the coding may or may not reduce the amount of predictive information because coding can sometimes help to linearize the data, making easier for linear algorithms to do their job. However, if we want to understand the structure of the input (e.g., as in semi-supervised, self-taught and transfer learning), ignoring the data types can be problematic because the coding process partially destroys the type-specific information. For example, in a bag-of-word representation of a document, it may be more natural to deal with the counting using Poisson models than converting counts into real-values (which are an approximation), binary indicators (which lose information) or replicated binary indicators (which increase the model size).

When the types are in the output, there have been two directions: One of to model the type directly, and the reductionist approach is to convert type-specific problems into the more popular binary problems.

The former case is quite interesting, and in doing so, ML would invent a few more terminologies along the way:

For continuous outputs, the natural solution is using some regression techniques, such as ridge regression. However, this assumes the error structure is normal while it is not always necessarily so.
For binary outputs, this is really the canonical problem in machine learning.
For ordinal outputs, the problem is often called ordinal regression.
For counting outputs, a standard solution is Poisson regression.
For categorical outputs, this is often called multiclass problems, and now there are many solutions, under different names, e.g., multiclass SVM, maximum entropy, multinomial logit, multinomial probit.
For rank outputs, the problem is called label ranking.
For multiple binary outputs, the problem is often known as multilabel learning.

The main contribution from ML is the introduction of the notion of margins and the analysis of generalization and stability of algorithms, as well as specific algorithms such as decision trees, neural networks, boosting, random forests.

In the latter case, where we transform the complex problems into simpler ones, we may convert ordinal regression and ranking into a series of binary classifications from which existing methods such as SVMs can be reused. This can be an useful way since we don't have to invent new ML algorithms, this may also make the problem harder than necessary because we need to recombine binary predictors into the original form.

With regard to mixed-type analysis, its presence in machine learning is very limited, and I am only aware of a couple of attempts: here and here. The latter is ours, and we will cover in the subsequent posts.

Some research challenges

An important goal of mixed type modelling is to derive a joint distribution of all types. Ultimately, we would want a representation scheme that aids understanding of interaction among types, and at the same time, supports efficient inference (e.g., estimating marginals or expectations) and learning (e.g., structure and parameters). Probabilistic graphical models would be an excellent candidate here.

Sometimes, a full joint distribution is not needed if we care about prediction of some output variables given the rest. In machine learning, mixed-type outputs would be an instance of multitask learning where each output is a task. Thus, estimating a shared structure between output types (given arbitrary input types) would be an interesting direction here.

In some applications, on the other hand, joint distribution modelling might not be the ultimate goal. For example, we would need just a similarity measure between two mixed-type instances (e.g., in k-nearest neighbour classification, k-means clustering and kernel-methods), or a good visualisation of all instances in 2D (this often boils down to similarity estimation). This is a challenging problem because each type would lead to a measure which may not compatible with another in scale or in interpretation.

And finally, often we want to make use of some handy data analysis tools such as PCA/SVD/ICA/CCA/SVM/GLM on top of mixed data. Since these tools do not work on mixed data, the problem now is to turn mixed data into some real-valued vector representations which preserve as much as information of the original mixed-data as possible. In other words, we now have a problem of representation learning from mixed-data.

Can these challenges be overcome by a unified framework? In the subsequent posts, we would attempt to provide some early clues. Stay tuned.

Updated references:

Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011), Taoyuan, Taiwan, Nov 2011.
Embedded Restricted Boltzmann Machines for Fusion of Mixed Data Types and Applications in Social Measurements Analysis, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 15-th International Conference on Information Fusion (FUSION-12), Singapore, July 2012.
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.
Thurstonian Boltzmann machines: Learning from multiple inequalities, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 30th International Conference in Machine Learning (ICML’13), Atlanta, USA, June, 2013.
Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of IEEE International Conference on Multimedia and Expo (ICME), San Jose, California, USA, July 2013.
Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh,International Conference on Advanced Data Mining and Applications (ADMA 2016).
Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, arXiv preprint arXiv: 1610.06249.

Sunday 24 June 2012

Markov random fields for recommender systems III: Embedding ordinal matrix factorisation

In the previous posts I have described how Markov random fields (MRFs) can be used to represent the local structures and latent spaces in recommendation. The advantage of MRFs is in its disciplined interpretation and inference. However, the employment of log-linear modelling in MRFs does not enable a simple treatment of ordinal ratings, which we have claimed to be of great importance.

The goal of this post is to develop a MRF-based model that achieves both the representational power of the probabilistic graphical models and and the ease of modelling ordinal rating by matrix factorisation. In short, our new model combines a MRF and an ordinal matrix factorisation (OMF) method. For simplicity, we shall focus only on the MRF with local structures. This is because the latent spaces have been captured by the OMF. Of course, nothing will prevent us from modelling the latent spaces by the MRF itself because the two approaches (OMF and MRF) are very different, one is linear and another is non-linear.

The OMF

Recall that the OMF aims to model an ordinal distribution \( Q(r|u,i) \). The standard way is to assume that there exists a latent utility function \( f(u,i) \) that captures how much value the user \( u \) judges the item \( i \) . For simplicity we assume that the utility has the linear, parametric form \( f(u,i) = a_u + b_i + W_u' H_i \), although nonlinear and nonparametric options can be possible. The assumption by McCullagh is that the rating is chosen based on the interval to which the utility belongs:
\[ r_{ui} = l \mbox{ if } f(u,i) \in (\theta_{l-1},\theta_{l}] \]
for \( l < L \) and
\[ r_{ui} = L \mbox{ if } f(u,i) > \theta_{L-1} \]
where \( L \) is the number of ordinal levels. Here usually we fix \( \theta_{0} = -\infty \). Other assumptions are also possible but this is by far the most popular. The probability of receiving a rating is therefore
\[ Q(r=l|u,i) = \int_{\theta_{l-1}}^{\theta_l}P(f(u,i)| \theta) = F(\theta_l) - F(\theta_{l-1}) \]
where \( F(\theta_l) \) is the cumulative distribution evaluated at \( \theta_l \). The thresholds can be prameterised to depend on user and item but care must be taken to ensure that they are monotonic in \( l \), e.g., \( \theta_{ui,l} = \theta_{ui,l-1} + e^{\eta_{u,l} + \gamma_{i,l}} \).

Depending on the choice of the distribution over \( f(u,i) \) we may end up with the logistic form of \( F(\theta_l) \) or its probit alternatives. See our recent AAAI'12 paper for a comparison.

The hybrid model of MRF and OMF

Now we wish to build a MRF model by multiplying the point-wise OMF and the item-item interactions:
\[ P(\mathbf{r}|u) \propto \Psi_u(\mathbf{r}) \prod_{i \in I(u)} Q(r_{ui}|u,i) \]
where \( \Psi_u(\mathbf{r}) > 0 \) is the potential function capturing the interaction among items and \( \mathbf{r} \) is the set of items rated by user \( u \). In essence this model promotes the agreement between the latent aspects discovered by the OMF and the local structures in the data. The usual form of the potential function is factorised into pairwise potentials
\[ \Psi_u(\mathbf{r}) = \exp \left(\sum_i\sum_{j>i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]
where \( \beta_{ijm} \) is parameter and \( g_m(r_i,r_j) \) is feature function. For example, the feature function can be just indicators
\[ g_{lk}(r_i,r_j) = \delta(r_i,l) \delta(r_j,k) \]
as in standard treatment of categorical variables; or bilinear association
\[ g(r_i,r_j) = (r_i - \bar{r}_i)( r_j - \bar{r}_j)\]
as in usual Gaussian assumption where \( \bar{r}_i \) is the mean rating for item \( i \); or the metric cost
\[ g(r_i,r_j) = |r_i - r_j| ^ p \]
for \( p > 0\).

Model estimation

Like most MRFs we cannot estimate the parameters easily using the standard maximum likelihood approach. Currently, there are two most effective alternatives: The pseudo-likelihood of Besag and the stochastic gradient using truncated MCMC. The pseudo-likelihood leads to exact computation of the loss function and its gradient with respect to parameters, and thus may be preferred by some practitioners. The MCMC-based methods may, on the other hand, lead to better estimation given enough time.

Let us present here the case for pseudo-likelihood. We need to estimate the local conditional distribution
\[ P(r_i | \mathbf{r}_{\neg i}) \propto Q(r_{ui}|u,i) \exp \left(\sum_{j\in I(u), j \ne i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]
Maximising the log of product of all local conditional distributions with respect to parameters will give us the result.

Rating prediction

Predicting the new rating is easy, we just need to compute \( P(r_j | \mathbf{r}) \) in the same way as we have done in the pseudo-likelihood and search for the most probable rating (for metrics such as MAE), or its expectation (for metrics such as RMSE).

Joining user-based models and item-based models

As we have always argued, each item should has its own life, much in the same way as the users. Thus we can build item-based model with user-user interactions. Since user and item are complementary entities, the two model types can be fused into one. The idea is quite simple: We just need to multiply all models together and remove duplicates (because the OMF components will appear twice):
\[ P(\mathbf{R}) \propto \left[\prod_u\Psi_u(\mathbf{r}_u)\right] \left[\prod_i\Phi_i(\mathbf{r}_i)\right] \prod_{u,i} Q(r_{ui}|u,i) \]
where \( \mathbf{R} \) is the collection of all seen ratings.

Connection to our AAAI'12 paper

In our AAAI'12 paper we suggest the following form of the utility function
\[ f(u,i) = a_u + b_i + W_u' H_i + \sum_{j \in I(u)} w_{ij} g(r_j) \]
This appears to be quite similar to our local (log) probability in the pseudo-likelihood method. Of course the fine details will be different but in essence the information captured by both the approaches may be similar. Another subtle difference is that in the MRF treatment, the pairwise interactions are symmetric (i.e., \( w_{ij} = w_{ji} \)), while in this model, it is asymmetric \( w_{ij} \ne w_{ji} \).

Previous posts: Part 1: Modelling local dependency | Part 2: Discovering latent spaces.

Thursday 21 June 2012

Markov random fields for recommender systems II: Discovering latent space

In the previous post we talked about how Markov random fields (MRFs) can be used to model local structure in the recommendation data. Local structures are powerful enough to make our MRF work, but they model only second-order interactions. We now explore the option of discovering latent spaces in the data. The idea is to capture weaker but higher-order interactions globally.

Restricted Boltzmann machines for user latent space

Restricted Boltzmann machines (RBMs) offer an interesting framework: observed ratings can be encoded into the input layer, and the latent factors are represented by the hidden layer. More precisely a RBM is a special type of MRFs where only connections across input nodes and hidden nodes are allowed. As each hidden unit (latent factor) is connected to every input units, high-order correlations among input units are nicely modelled.

Now we proceed to build RBMs for users, assuming for the moment that users are independent in their choices. And thus we build one RBM per user. This is because each user rates only a handful number of items, and thus building the full model of all items is not well justified. For now let us assume that we include only seen items into the model for estimation. At test time, we will introduce unseen items into the model assuming that the model won't change.

For simplicity, assume that latent factors are binary. Denote by \( \mathbf{r} = (r_1,r_2,...,r_N) \) the vector of input variables, and \( \mathbf{h} = (h_1,h_2,..,h_K) \) the vector of latent vectors, where \( h_k \in \{0,1\}\). The energy of the RBM reads
\[ E = - \left ( \sum_i \sum_m\alpha_{im} f_m(r_i) + \sum_k \beta_k h_k + \sum_i\sum_m\sum_k w_{ikm} f_m(r_i)h_k \right ) \]
where \( f_m(r_i) \) is feature function and \( \alpha_{im}, \beta_k, w_{ikm} \) are parameters.

Depending on your argument of the nature of the ratings, we may end up different feature functions, and thus different models such as Gaussian, categorical or ordinal. We can also incorporate item contents and user profiles into the model. For example, profile aspects can be treated as some kind of items which we always know and never have to predict. For this, perhaps a conditional RBM would make sense: We may want to model \( P(\mathbf{r} | \mathbf{p}) \) instead of \( P(\mathbf{r}, \mathbf{p}) \) where \( \mathbf{p} \) is the profile vector.

The model can now be estimated, either by deterministic methods such as pseudo-likelihood or stochastic methods such as Contrastive Divergence or persistent MCMC. Note that for the pseudo-likelihood method, we need to integrate over the hidden units, so it is not entirely the same as the original Besag's proposal, but the principle is the same.

At test time, we basically look for
\[ r_{j}^{*} = \arg\max_{r_{j}}P(r_{j}|\mathbf{r}) = \arg\max_{r_{j}}\sum_{\mathbf{h}}P(r_{j},\mathbf{h}|\mathbf{r}) \]
Thus we need to sum over the hidden layer, which is efficient to do because conditioned on the seen ratings, the graphical model is now a tree. Alternatively, we may resort to mean-field approximation: the stochastic hidden unit is replaced by its deterministic conditional expectation given the seen ratings, i.e., \(h_k \leftarrow P(h_k=1|\mathbf{r}) \). The big O complexity is the same, but it is numerically faster since we do not need to deal with the computation of expensive exponential functions.

A more subtle question is that what if we don't care about exact rating prediction, but any numerical approximation would do? This is often the case in evaluation metrics such as MAE or RMSE. The problem with the RMSE is it penalises zero-one errors to much, while it may be the case that the RBM would make arbitrary zero-one errors. A smoother version of the rating prediction would definitely help here:
\[ \bar{r}_{j}=\sum_{S=1}^{L}P(r_{j}=S|\mathbf{r})S \]
for the case of \( L \) rating scale.

Finally, the choice of number of hidden unit \( K \) is hard to estimate directly. The random projection theory (e.g., see this KDD paper) suggests that \( K \) is about the same order as the log of the number of users. In practice, however, we can use cross-validation to choose the best value. My experience suggests that once it reaches some reasonable large (e.g., 50-100), the size doesn't matter much, and we don't have to worry too much about overfitting. Perhaps the stochastic nature of the model plays the role here.

A variant of this RBM with categorical assumption of rating (which largely ignores the ordinal constraints) was introduced in ICML'07. It was demonstrated for the first time that RBMs can be a serious competitor for large-scale collaborative filtering. Since then, RBMs are an integrated part of the so-called "ensemble methods" for high performance systems.

Not-so-restricted Boltzmann machines

Recall that in the RBMs, the direct interactions between items are not considered. This is orthogonal to the standard MRF we consider in the previous post. A natural idea is therefore to combine the two approaches: We model both the direct, local interactions and the indirect, global counterparts. Hopefully this would strengthen each method and lead to better predictions.

With energy-based model such as MRFs, this is fairly easy: We just need to add up the energy function and removing the duplicate biases! The results so far are encouraging -- the combination actually works suggesting that the two approaches capture different type of data aspects.

The joint model of everything

For now we have a semi-restricted Boltzmann machine for each user. However, it is also plausible that each item deserves the same treatment! Besides, the original assumption that the user's choices are independent is not correct: Users only choose those items which are reachable, and it is known that users are influenced by the current choices of their peers.

Thus, we need to integrate all the user-specific models and item-specific models together. Fortunately, this is not so difficult for MRFs. Again, we just need to add energy functions and remove duplicates. The thing we need to worry about is learning for this complicated model. But a closer look that the model structure would suggest a way out: Conditioned on user-specific units, item-specific models are independent. The same holds if we flip the roles of users and items. Thus, an efficient method would be structured pseudo-likelihood (and mean-fields): We look for a block of units that the same time while conditioning on the rest.

Until next time, I wish you enjoy the giant model with a lot of connections and hidden units! See our UAI'09 paper for more technical details.

Previous post: Part 1: learning local dependency | Next post: Part 3: embedding ordinal matrix factorisation

Saturday 16 June 2012

Markov random fields for recommender systems I: Modelling local dependency

As I have mentioned, there are two general approaches to collaborative filtering: the latent space (e.g., using matrix factorisation) and the local structures (e.g., using graph to estimate user-user similarity or item-item similarity). A natural question is then can the two approaches be combined? Numerous evidences have suggested that the answer is yes: The latent space assumption and the local structures assumption seem to be somewhat complementary, or at least do not overlap entirely. I guess they do partly overlap because by using spectral decomposition of the similarity matrices, we can also discover a latent space, but it does not guarantee to be the same as those obtained by factorising the rating matrix.

The only issue is that the combination of the two methods so far appears a bit heuristic, or at least the two approaches do not come from the same unified representation. The quest for such a unification may be more of theoretical interest rather than practical ones: After all with ensemble methods, we can combine any different methods as long as they are fairly diverse.

It turns out, probabilistic graphical models (PGMs) nicely offer such a representation so that the ratings/preferences data can be seen as a probabilistic database from which many queries can be answered probabilistically. For example, rating prediction can be cast as an inference problem where we need to infer the probability that an user will assign a rating value to a particular item. More importantly, it separates out the issue of model building (e.g., structure specification and parameter estimation) from inference (e.g., rating prediction or item ranking).

There are two general ways to specify a PGM: using directed models and using undirected models. Existing directed models for collaborative filtering appeared earlier, and they include Bayesian networks and those variants of the so-called topic models: the PLSA and LDA. The main problem with directed models is that we need to be really careful in our model design to make sure that the structure is a DAG (there will be no loops), and every link from a parent to a child makes some sense. Incorporating features and domain knowledge can be hard and generally requires great design skills to make the model correct.

Undirected models, also known as Markov random fields, Markov networks or their variants such as Botlzmann machines, restricted Boltzmann machines, relational Markov networks, and Markov logic networks are more flexible: They allow loops and can incorporate arbitrary features extracted from data or side constraints. This post will solely focus on this undirected class.

Markov random fields for user-specific local structures

Let us start with the local structures (or the neighbourhood). Similarity scores such as Pearson's correlation or cosine are perfectly OK on their own (parameter estimation stage) but their combination to make prediction does not seem to be theoretically guided (inference stage). Markov random fields (MRFs), on the other hand, allow both the stages to be guided in a principled way.

First we need to specify the model structure. For clarity, let us first start with the assumption that users are largely independent in their choice. We will argue later that this assumption is usually wrong in practice, but let the assumption hold for a while.

Our goal is to build graphs, one per user, so that ratings are modelled as node variables, and interaction between items is modelled as edges. Here we depart from the standard use of MRFs by building multiple graphs instead of just one. This is because users only rate only a handful number of items, and building the full graph for all items is not very attractive: First, the graph will be too big for efficient handling. Second, most of the variables will be missing, making learning and inference cumbersome. And finally, assuming most missing variables to be 'just' missing does not seem right: Most items shouldn't be there, or at best, are irrelevant to the taste of the user.

The key here is that although graphs are different, they all share the same set of parameters. Otherwise, learning will not happen robustly, and prediction will be impossible with unidentified parameters.

So we end up including only those seen items into the user-specific graph. Note that it is certainly not optimal, because they will be surely other unseen items which will be highly relevant to the user. For example, at test time we need to include unseen items to make inference. But it doesn't seem to be an issue: At test time, we just need to plug one test item at a time, assuming that the user-specific MRF is unchanged in their parameters, and then we do inference \[ r^* = \arg\max_r P(r | \mathbf{r}(u)) \] where\( r \) denotes the rating, \( \mathbf{r}(u) \) denotes the set of seen ratings by user \( u \) .

The next question is how dense the graph should be? Ideally, the graph should be fully connected because all the items will interact with each other under the same user. However, this is not efficient: The number of edges is quadratic in number of items. Second, it makes learning overfit to weakly interacting pairs.

A better way should be making the graph sparse, and it we would eliminate weakly interacting items. If we do the graph structure estimation automatically, this would be the so-called structure learning in probabilistic graphical models. A more heuristic way is to use some correlation measures: If the two items are strongly positively or negatively correlated, their interaction should be kept. However, it is not easy to determine the right amount of interaction strength. An easy way is to keep only K most strongly interacting neigbours for each item.

A hidden question is how should we compute the correlation measure. There are number of information sources we can exploit here: From the ratings alone, we can look for the common users between two items. From the content, we can match two items if their descriptions are similar (e.g., two movies directed by the same directors or played by the same lead actor/actress). Thus the use of content can be implicit.

Are we done with the structure yet? Not quite. MRFs are undirected, and thus the connections are symmetric: if node j is connected to i then i must also be connected to j. However, the edges discovered from the K-nearest neighours are asymmetric: if j is a neighbour of i, it does not necessarily mean i is a neighbour of j. Thus from the K-nearest neighbours per node we need to add more edges to make the neighbourhood consistent.

The next step is parameterisation. We would use the following energy function
\[ E (\mathbf{r}_u) = -\Big ( \sum_i\sum_m \alpha_{im} g_m(r_i) + \sum_i\sum_{j>i} \sum_m\beta_{ijm} f_m(r_i,r_j) \Big ) \] where \( g_m(r_i) \) and \( f_m(r_i,r_j) \) are feature functions capturing the properties of the ratings and the interaction between two items; \( \alpha_{im} \) and \( \beta_{ijm} \) are parameters.

We will not go into the details of feature design as it is more an art than science. We wish to comment that we can come up with Gaussian random fields, categorical random fields or ordinal random fields, depending on the choice of these functions. See our UAI'09 paper for the comparison of these three model types.

Note that we can easily incorporate item-content and user-profile into these feature functions, making the model an attractive hybrid network which will help combating against the so-called cold-start problem. This is a clear advantage of undirected models where we can just integrate diverse sources of information with little extra cost.

Once the model parameters have been estimated, the prediction for unseen item j is quite easy:
\[ r_j^* = \arg\min_{r_j} E (\mathbf{r}_u, r_j) \]
From user-model to item-model to integrated model

Let us return to the assumption we made earlier: Users are independent in making their choice. This is not always the case: Users are generally influenced by their peers and the media. From the data representation point of view, the rating matrix does not distinguish the role of columns and rows -- by a transpose we still have the same matrix!

Thus it is plausible that we can build a model for each item in the same way that we did earlier with the users. The only difference is that now we need to model user-user interactions.

Given the fact that we now have either user-specific models or item-specific models, the natural question is can these two approaches be unified?

The answer is yes, and in a simple way: We just need to add the two energies functions together. Now we have a giant MRF connecting every user and every item.

From rating-prediction to item-ranking

Recall that the real goal in recommender systems is less about rating prediction but more about producing a rank list of unseen items for a particular user. It turns out, there is a way: If an new item causes more expected reduction in model energy, the item is likely to be more influential, and it should be ranked higher.

To sum up, we now have a principled way to integrate users, items, contents and profiles into the same model which supports structure specification, parameter estimation, rating prediction and item ranking. For more technical details, we refer to our AusDM'07 paper.

In the next part of the post, we will focus on how MRFs can be used to discover latent spaces and how to integrate with the local structures.

Next posts Part 2: discovering latent spaces | Part 3: embedding ordinal matrix factorisation

Tuesday 12 June 2012

Ranking by estimating a cone

Geometry is a fascinating world, and in fact many machine learning algorithms can be expressed in the geometrical language: hyper-planes as in SVM, space partitioning as in k-d trees, decision-trees and other non-parametric Bayesian methods, manifold learning, information geometry, to name a few.

Here we will be concerned about a particular geometrical object known as cone (or to be more cultural, have a look at the so-called conical Asian hat). A cone has a special geometrical property that any non-negative combination of vectors belong to it will also belong to it. Mathematically it says for a cone \( C \), if \( x,y \in C \) then \( \alpha x + \beta y \in C \) for any \( \alpha,\beta > 0 \). But doesn't seem to be interesting.

What interests us more is the cone allows us to define a more general concept of inequality. Now we can freely compare vectors instead of points: A vector is said to be greater than another if the difference belongs to a cone. And thus, we can write
\[ x {\succeq}_C y \]
to denote that vector \( x \) is larger than or equal to vector \( y \) with respect to cone \( C \). By definition, we have
\[ x-y \in C \].

It turns out, this concept of generalized inequality can be exploited in our problem of learning to rank (LTR) and collaborative ranking. Let's see how.

Recall that in LTR, one must order objects according to their preferences or perceived utility and relevance. A standard way is to estimate a ranking functional which takes a pair of (query,object) and returns a real score. As we have already mentioned, this is not the only way. We can take a pair of objects and return their ordering.

Now suppose we are given two objects and their true ordering (as always in training data). For convenience we will assume that each (query,object) pair is represented by a vector. How to come up with this vector is an interesting problem on it own, and this is pretty much domain-specific. For example, in Web search engines, the vector can contain elements of relevance measures from different criteria such as TF-IDF, Okapi BM25, title matching and quality measures such as domain authority, structural designs, number of in-links, PageRank, timeliness, or so.

What we are interested here is the fact that we have an ordering between the two vectors. By ordering, we assume that there exists an inequality between the them. And now, the generalized inequality with a cone will come in. What is missing is the cone itself: We don't know it in advance, and thus it must be estimated. And we shall assume that there will be only one cone for the problem at hand, although for some problems more than one cone may be needed.

First, we need a parametric way to represent the cone. Recall that any non-negative combination of two vectors in a cone will stay in a cone. This can be generalised easily: any non-negative combination of any number of vectors in a cone will also stay in a cone. This suggests a way: a cone can be represented by a several basis vectors and all other vectors can be generated from this basis set. This cone is polyhedral in the sense that it is both a polyhedron and a cone.

Let us denote by \( u_1,u_2,..,u_K \) the \( K \) basis vectors. An ordering can be represented as
\[ x-y = \sum_{k=1}^K w_k^{(xy)} u_k \]
where \( w_k^{(xy)} > 0 \) are the coefficients for the pair \( (x,y) \). Thus we are left with two unknown sets of parameters to estimate: the shared basis vectors and the pairwise coefficients. However, we will not cover the details of the estimation for now but refer to our recent work (in progress) here.

Now let's assume that we have estimated the basis vectors, what can we do for prediction? It seems since we do not know which order is best, we need to try both and compare the error rate of the two directions. But this is quite inconvenient because of the non-negative constraints. Fortunately, we found an easy way: just do unconstrained regression in one direction and check for the sign of the sum of the coefficients. If the sign is positive, then the direction is correct; otherwise, the other direction is more accurate ordering.

Saturday 9 June 2012

Collaborative ranking II: The estimation

Designing the rank functional

Let us for now concentrate on the the point-wise rank functional which takes as input the pair (user,item) and returns a real score. Note that we do not mean point-wise loss functions, which will be discussed later in this post.

As inspired by the SVD approach in standard (real-valued) rating prediction (especially after the Neflix challenge), it is natural to go for a function of the form

\[ f(u,i) = W_u'V_i \]

where \( W_u \) is the feature vector of user \( u \), \( V_i \) is the feature vector of item \( i \). Indeed, this is not the only way because the function \( f \) is a measure of similarity in the feature space. So probably kernels are applicable here (this particular inner product is really a linear kernel).

Note that the features don't have to be discovered if they are available, e.g., user profiles and item content descriptions. Or the features can be mixed: some are latent (a.k.a. random effects), and some are given (a.k.a. fixed effects). Our forthcoming paper in AAAI'12 offers even more interesting fixed features: the popular items (and their associated ratings for this particular user) themselves can also be highly informative features.

What is more interesting is that the user features and item features don't have to stay in the same space. What we need to do is to build a bilinear functional form

\[ f(u,i) = W_u' A V_i \]

where \( A \) where is the association matrix. Since \( A \) may be huge, we may choose to factorise it into two smaller matrices for robust estimation, i.e., \( A = B'C \) where the ranks of \( B \) and \( C \) are small compared to the size of features. That is effectively equivalent to applying linear transforms to the user features and item features so that the projected features belong to the same space: \( W_u \rightarrow BW_u \) and \( V_i \rightarrow CV_i \).

The rank 1 problem

One of the annoying properties of the point-wise rank functional with the linear factorisation as discussed above is that some some datasets, the number of hidden features can be as little as 1 or 2. More precisely, the rank of the two matrices \( W, V \) can be 1 or 2. This is hugely different from standard SVD in rating prediction, where the effective dimensionality is often in the range of 50-500, depending on how much data we have. In fact, we don't even need any fancy learning, just ranking items according to their popularity will do a good job. See this paper for more discussion.

It has been suggested that we should prepare test data that contains only items which are not popular and are controversial (i.e., the rating variance/entropy is high among users). With more diversity, more features will be needed.

The true reasons behind this phenomenon are still unexplored. My guess is that in standard rating prediction, since it is important to get each rating right, we need to map the user's profile into a high dimensional space of item ratings. On the other hand, in ranking with point-wise rank functional, we just need to map into a single real line. Thus the number of bits needed for rating prediction is larger than what needed for ranking. My conjecture is that the number of bits needed grows with the degree of the polynomial which is best describing the rank functional.

The loss functions

In order to estimate the rank functional, we need to minimise some loss function. Ideally the loss function should be the error metric used in evaluation. Typically rank metrics discount the items deep in the list. Well-known examples include: NDCG at top T, MRR, MAP, and more recently the Yahoo! ERR. Unfortunately, these metrics are non-smooth, and thus it is difficult to optimise directly. A reasonable method is to approximate the rank by some smooth functions, for example

\[ \bar{r}_i = \sum_{j \ne i} P(f(i) > f(j)) \]

where \( P(f(i) > f(j)) \) is the probability that item \( i \) is ranked higher than item \( j \). One useful probability form is the sigmoid function

\[ P(f(i) > f(j)) = \frac{1}{1+e^{-\gamma(f(i) - f(j))}} \]

where \( \gamma > 0 \) is some scale parameter. The main problem with this approximate loss is that it can be non-convex and flat when the pairwise probability is close to 0 or 1 making optimisation not a straightforward exercise.

Another strategy is to apply some surrogate loss functions. It is desirable that the loss function takes the whole list into account (the listwise loss). For example, the Luce's model (subsequently rediscovered by Plackett) reads:
\[ P(f(1) > f(2)) > ... > f(M)) = \prod_{i=1}^M \frac{e^{f(i)}}{\sum_{j=i}^M e^{f(j)}} \]
However, there has been work proving that even pointwise losses and pairwise losses are good upper bounds of some popular rank metrics.

Taking ties into account

Up to this point, we haven't addressed the mismatch between the ratings as in training data and the ranking as in the test output. In order words, the labels are in the form of ordinal variables, while the the end goal is to produce a rank.

In theory it is easy to convert the ordinal values into ranks. However, since the number of ordinal levels is usually small (e.g., as in 5-star scale), there will be many tied ranks. Standard sorting typically ignores ties and thus gives you a rank where ties are broken arbitrarily depending on how you index your items. Thus much information is lost during this conversion.

A better way is to treat ties directly. Early methods in the statistical literature include the Rao-Kupper method in 1967 and Davidson method in 1970 for extending the standard pairwise Bradley-Terry model. Last year we made a contribution (published in SDM'11) to this area by treating the ties entirely in the list, without breaking them into pairs. The idea is that, ranking with ties can be seen as an instance of the problem class of simultaneous partitioning a set into subsets and ordering the subsets. It turns out the space of such combinatorial operation is extremely large -- it is known in combinatorics that the space size is the Fubini number or ordered Bell number. However, we don't have to evaluate such a combinatorial space when constructing the loss function. A nice result is that there exists at least a case where the loss function can be computed in linear time.

[Part 1] [Part 3]

Sunday 3 June 2012

Collaborative ranking I: Problem setting

As I have mentioned in the previous post, and there are two important aspects in modelling data of collaborative filtering (CF): one is the ordinal nature of the ratings, and another one is the end goal of CF in producing a ranked list (of new items for a particular user, or of new users for a particular item) rather than a set of ratings. This post is about the latter aspect. Indeed, I plan a series of posts, so stay tuned.

For clarity, we shall focus exclusively on producing a ranked item list for a particular user. We leave to another post the orthogonal problem of producing a ranked user list for a particular item.

There are really two steps involved in producing an item list: (i) identify a candidate list, and (ii) rank them. Let us look at each of them.

Identifying a candidate set

This seems trivial but it is not. One might think that we can use all the remaining items, but it is not very effective. First, each user is really interested in only some items. Second, including all the good items is not necessarily satisfactory to the user: They often want a mix of items of different tastes, even though some of they may not be of high quality. And third, how can you measure the quality of the candidate set? What if all the candidate items are of high quality but none of them will be actually in the future list?

There are a number of ways to produce a candidate list. The most intuitively way is to look for other items which similar users have seen. More specifically, we first identify top K similar users and aggregate all the items seen by them, minus those already seen by this particular user. This poses two questions: (a) what is the similarity measure, and (b) what is the best K? Adequate answers to each question deserves a research paper itself. So we will not pursue here.

There are of course other system criteria -- one might look to maximise the purchase rate, or the total revenue per user, the reduction of inventory, the higher degree of diversity, or so.

Producing the ranked list

Given the candidate set, a straightforward way to produce the ranked list is to first predict the rating for each item, and then sort them. For this, we do not need any specialized ranking algorithm, but it is less exciting: One might argue that since our objective is ranking, we should learn a ranking function in the first place.

What is a rank function? The simplest rank function might take a pair of (user,item) and return a real score. These scores will be used to sort items. For this the ranking will be easy -- the cost is usually Nlog(N) for standard sort algorithms. The drawback of this method is that it implicitly assumes that all items are independent, whereas indeed they are not, because after all, they are liked by the same user (and not just one user). Nevertheless, this method is by far very popular, dating back to the Luce's axiom of choices in the 1950s, to the recent rise of learning-to-rank in information retrieval and label ranking in machine learning.

There is another way: a rank function can take a triple (user,item1,item2) and return the ordering between item1 and item2. Ranking among all items will be based on voting, e.g., number of times a particular item is ranked higher than the others. The cost of this method will be N(N-1)/2. One drawback of this method is that it does not enforce the transitivity, i.e., if item1 is ranked higher than item2, and item2 higher than item3, then item1 should be ranked higher than item3. It also does not take higher-order interaction among items into account. Despite of these shortcomings, this method is widely practised, and one of the earliest citations is Bradley-Terry's work.

This leaves the most informative method: Think of the rank function is a way to produce a permutation whose utility for the user is maximum. In other word, we must solve a set-wise problem rather than the point-wise or pair-wise problems. However, there is a big problem here: the computational complexity is bounded by N!, which is too expensive for even moderate N. Thus, we would not address this method for the moment.

Updated references:

Probabilistic Models over Ordered Partitions with Applications in Document Ranking and Collaborative Filtering, T. Truyen, D. Phung, and S. Venkatesh, in Proc. of SIAM Int. Conf. on Data Mining (SDM11), April, Arizona, USA, 2011.
Learning from Ordered Sets and Applications in Collaborative Ranking, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.

[Part 2]

Thursday 31 May 2012

Ratings in collaborative filtering should be treated ordinally

Collaborative filtering (CF), like its umbrella, recommender systems, is hot right now. Commercial CF systems have been implemented by major players like Amazon, Google News, TiVo, and Netflix. Within the academic world, CF research is now regularly delivered in popular venues such as information retrieval (or information filtering), information science, knowledge management, social media, data mining, machine learning, AI, statistics, human-computer interaction, drug-discovery, social sciences, marketing, and of course, the ACM RecSys annual conferences.

The basic idea behind CF is that, since people are interested in common items, and thus there must exist some shared structures, e.g., some people are really similar in their tastes or some items are really similar in their characteristics. The sharing will eventually help to predict how much an user will like an item he or she hasn't seen before.

Depending on your problem representation, you will end up with different algorithms.

One particularly appealing representation is matrix (e.g., see here). Here we are given an incomplete user-item matrix, where observed entries are often ratings by users on items they have seen. The tasks are either to predict the ratings of remaining entries, or to produce a list of unseen items for each user (or other way around, a list of new users for an item).

Under this representation, the shared structure can be captured by the low-rank decomposition, that is, the user-item matrix is really a product of an user-feature matrix and an item-feature matrix, where features are latent aspect of each user and each item. The number of latent features is often small compared to the original matrix sizes. In effect, we project both users and items onto the same latent space, and the preference is measured by some similarity between an user and an item in that space.

Another representation is bipartite graph, where users and items are nodes in the graph, and a weighted link between an user and an item indicates the level of interest. The problem is now to predict the missing links and their weights.

However, these two representations are somewhat complementary since they do not overlap entirely. These are evidenced in recent successful methods which combine both the approaches: here, here and here.

Surprisingly, both camps did not even consider the fact that (i) the ratings are indeed ordinal, (ii) we more or less interested in ranking items rather than rating them.

The second problem deserves separate posts, and here we shall focus on the first one.

Ordinal variables are those categorical variables whose relative orders between categories is the most important information, such as those in standard survey options: {very bad, bad, OK, good and very good}. You may be familiar with number of stars given to a movie, but these stars shouldn't be treated as a (real or integer) number. This is because, first, people are different in their use of rating values -- for some a 4 star movie means very excellent, but for other, it is may be just OK. Second, you can't be sure that these stars are spaced equally -- people are really bad at making consistent quantitative judgement.

On the other extreme, ordinal variables may be treated as standard categorical variable, where each time we pick only one category, assuming that the utility of that category is the highest among all categories. However, this treatment effectively ignores the fact that ordinal values are ordered. Under the categorical treatment, all mistakes are equally bad, but in a proper ordinal treatment, a mistake that is far from the true category should be considered as worse than those near by in the ordinal scale.

The absence of adequate ordinal treatment in CF might suggest some degree of ignorance in machine learning research: we sometimes do not pay attention to existing research in statistics. As for ordinal data, the statistical community has been addressed it for many decades under the umbrella of ordinal regression, and a hugely influential paper was written in 1980 by McCullagh. One may argue that the natures of the collaborative filtering and ordinal regression are different. In fact, they do not: The setting of collaborative filtering is identical to that of the so-called "multi-raters" problem, and the modelling of the raters itself has been long known as the random effect models.

Only very recently, in CF there has been some work addressing the ordinal nature of the data: here, here, here, here and here. The last two are ours, one was published in UAI'09 and the other one will appear in AAAI this year. The UAI'09 paper exploits the ordinal property in the feature construction but it doesn't really model ordinal ratings. The AAAI'12 paper addresses this shortcoming and the main contribution is the deviation from the standard McCullagh's model by adopting the sequential decision approach of Mare (1980).

To be more concrete, the McCullagh's model assumes that there exists a continuous utility that the user perceives when seeing an item. The ordinal level is determined by examining the interval to which the utility belongs. This property, when combining with the underlying assumption of the utility distribution, gives rise to several names: grouped continuous model, proportional odds models, cumulative logit model or ordered probit models. This model will work well if there exists the true interpretation of utility interval, such as income ranges, severity levels, etc.

On the other hand, the sequential decision approach assumes the decision is made progressively starting from the lowest level. If the utility is beyond the level's threshold, the next level will be considered. Otherwise, the current level is selected. This approach is attractive in the cases where stagewise progresses are made, for example, the education level, the health recovery status, etc.

The question is then, which approach is the best for CF? There are no easy answers because we do not know the true decision making behind the rating processes. For now we could only make some guess and verify on real datasets. That that is what we reveal in our forthcoming paper: The sequential approach is as good as, if not better than, the McCullagh's approach, for a number of large-scale datasets.

Updated References

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012.
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
Collaborative filtering via sparse Markov random fields, Truyen Tran, Dinh Phung, Svetha Venkatesh, Information Science, doi:10.1016/j.ins.2016.06.027.

Monday 21 May 2012

Machine learning at its turning point: Non-convexity

Machine learning had enjoyed the last decade of specifying and optimising convex regularised risks. It seems that we have reached the point when we need to move on, accepting the fact that global optimum may not be optimal after all.

To me, learning based on convex functions is desirable as long as we know what we are looking for. In many cases, however, we do not, and we are more interested in discovery. And one of the most effective way is to loosely specify latent variables which, hopefully captures the generative process of the data. Whenever we introduce latent variables, it is likely that we hit a non-convex objective function. In fact, the goal now is not to find a optimal configuration given the known, well-behaved search space, but to search for reasonably effective configurations in a largely unknown space. It is actually evidenced in most powerful natural living systems, such as human brains: They are very effective and time-tested, but they are hardly optimal in any sense.

This move from convex to non-convex is not linear. In fact, non-convex machine learning systems have been around since the beginning of computer-based modelling. The so-called Latent Variable Modelling has been around for decades in statistical sciences, and the standard, non-Bayesian way to optimise the incomplete data likelihood is usually the EM algorithm. One of the famous variant of the EM is the Baum-Welch algorithm for training HMMs. Another example is neural networks: As soon as they are getting interesting by having one ore more hidden layers, they are inherently non-convex.

The recent rise of deep architectures clearly illustrates this point further: One of the most basic building block, the Restricted Boltzmann Machines, has its representation power due to the hidden layer.

Now, given the objective function, if we can compute it at all, is non-convex, then how can we optimise it? There are two strategies: One is to approximate it by a surrogate convex function and iterate until converged and another one is to optimise the function directly. The EM algorithm is actually the first type, and it has been extremely popular. But is it the best way?

Hardly.

It is a nice way because we can have some sort of guarantee for its convergence to local optimum. But after all, it is still a local climbing method. Being clever at using a convex surrogate does not guarantee that it can stay away from bad optima.

It turns out the second strategy is equally good, although it is much less popular in the probabilistic world. Today we have at hand very efficient implementations of Conjugate Gradients and Limited-Memory Quasi-Newton methods for large-scale problems. Although these were perhaps intended for convex functions, they can be applicable for non-convex problems as long as we accept local optima as a reasonable choice.

Outside the probabilistic world, these techniques are hardly anything exciting because neural network people have been using them for years. The exciting parts are (i) how can we exploit the problem structures to obtain better algorithms, and (ii) whether we need to search for global optima at all?

I intend to leave this two questions unanswered, but the answer for the second question may be: No. And the rest of this post is about the first strategy.

In machine learning, there is a well-known method called Concave-Convex Procedure (CCCP) which was introduced a decade ago to deal with the non-convex situations. The idea is based on the observation any non-convex function can be decomposed into a convex part and a concave part. The interesting point is that, although the original authors made some attempt to relate it to the existing line of work in the global optimisation community: the Difference-of-Convex function Algorithm (DCA), machine learning people almost never cite the DCA, which is the earlier (around 1994-1998) and its body of knowledge is much more comprehensive.

I discovered about in 2005 that the log-likelihood of log-linear models with hidden variables is actually in the D.C. form. That fact is interesting but hardly a new discovery. What is more interesting is that when we apply CCCP/DCA to solve that problem, we actually obtain an update that is identical to that when we apply the EM.