## Sunday, 24 June 2012

### Markov random fields for recommender systems III: Embedding ordinal matrix factorisation

In the previous posts I have described how Markov random fields (MRFs) can be used to represent the local structures and latent spaces in recommendation. The advantage of MRFs is in its disciplined interpretation and inference. However, the employment of log-linear modelling in MRFs does not enable a simple treatment of ordinal ratings, which we have claimed to be of great importance.

The goal of this post is to develop a MRF-based model that achieves both the representational power of the probabilistic graphical models and and the ease of modelling ordinal rating by matrix factorisation. In short, our new model combines a MRF and an ordinal matrix factorisation (OMF) method. For simplicity, we shall focus only on the MRF with local structures. This is because the latent spaces have been captured by the OMF. Of course, nothing will prevent us from modelling the latent spaces by the MRF itself because the two approaches (OMF and MRF) are very different, one is linear and another is non-linear.

The OMF

Recall that the OMF aims to model an ordinal distribution $$Q(r|u,i)$$. The standard way is to assume that there exists a latent utility function $$f(u,i)$$ that captures how much value the user $$u$$ judges the item $$i$$ . For simplicity we assume that the utility has the linear, parametric form $$f(u,i) = a_u + b_i + W_u' H_i$$, although nonlinear and nonparametric options can be possible. The assumption by McCullagh is that the rating is chosen based on the interval to which the utility belongs:
$r_{ui} = l \mbox{ if } f(u,i) \in (\theta_{l-1},\theta_{l}]$
for $$l < L$$ and
$r_{ui} = L \mbox{ if } f(u,i) > \theta_{L-1}$
where $$L$$ is the number of ordinal levels. Here usually we fix  $$\theta_{0} = -\infty$$. Other assumptions are also possible but this is by far the most popular. The probability of receiving a rating is therefore
$Q(r=l|u,i) = \int_{\theta_{l-1}}^{\theta_l}P(f(u,i)| \theta) = F(\theta_l) - F(\theta_{l-1})$
where $$F(\theta_l)$$ is the cumulative distribution evaluated at $$\theta_l$$. The thresholds can be prameterised to depend on user and item but care must be taken to ensure that they are monotonic in $$l$$, e.g., $$\theta_{ui,l} = \theta_{ui,l-1} + e^{\eta_{u,l} + \gamma_{i,l}}$$.

Depending on the choice of the distribution over $$f(u,i)$$ we may end up with the logistic form of $$F(\theta_l)$$ or its probit alternatives. See our recent AAAI'12 paper for a comparison.

The hybrid model of MRF and OMF

Now we wish to build a MRF model by multiplying the point-wise OMF and the item-item interactions:
$P(\mathbf{r}|u) \propto \Psi_u(\mathbf{r}) \prod_{i \in I(u)} Q(r_{ui}|u,i)$
where $$\Psi_u(\mathbf{r}) > 0$$ is the potential function capturing the interaction among items and $$\mathbf{r}$$ is the set of items rated by user $$u$$. In essence this model promotes the agreement between the latent aspects discovered by the OMF and the local structures in the data. The usual form of the potential function is factorised into pairwise potentials
$\Psi_u(\mathbf{r}) = \exp \left(\sum_i\sum_{j>i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right)$
where $$\beta_{ijm}$$ is parameter and $$g_m(r_i,r_j)$$ is feature function. For example, the feature function can be just indicators
$g_{lk}(r_i,r_j) = \delta(r_i,l) \delta(r_j,k)$
as in standard treatment of categorical variables; or bilinear association
$g(r_i,r_j) = (r_i - \bar{r}_i)( r_j - \bar{r}_j)$
as in usual Gaussian assumption where $$\bar{r}_i$$ is the mean rating for item $$i$$; or the metric cost
$g(r_i,r_j) = |r_i - r_j| ^ p$
for $$p > 0$$.

Model estimation

Like most MRFs we cannot estimate the parameters easily using the standard maximum likelihood approach. Currently, there are two most effective alternatives: The pseudo-likelihood of Besag and the stochastic gradient using truncated MCMC. The pseudo-likelihood leads to exact computation of the loss function and its gradient with respect to parameters, and thus may be preferred by some practitioners. The MCMC-based methods may, on the other hand, lead to better estimation given enough time.

Let us present here the case for pseudo-likelihood. We need to estimate the local conditional distribution
$P(r_i | \mathbf{r}_{\neg i}) \propto Q(r_{ui}|u,i) \exp \left(\sum_{j\in I(u), j \ne i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right)$
Maximising the log of product of all local conditional distributions with respect to parameters will give us the result.

Rating prediction

Predicting the new rating is easy, we just need to compute $$P(r_j | \mathbf{r})$$ in the same way as we have done in the pseudo-likelihood and search for the most probable rating (for metrics such as MAE), or its expectation (for metrics such as RMSE).

Joining user-based models and item-based models

As we have always argued, each item should has its own life, much in the same way as the users. Thus we can build item-based model with user-user interactions. Since user and item are complementary entities, the two model types can be fused into one. The idea is quite simple: We just need to multiply all models together and remove duplicates (because the OMF components will appear twice):
$P(\mathbf{R}) \propto \left[\prod_u\Psi_u(\mathbf{r}_u)\right] \left[\prod_i\Phi_i(\mathbf{r}_i)\right] \prod_{u,i} Q(r_{ui}|u,i)$
where $$\mathbf{R}$$ is the collection of all seen ratings.

Connection to our AAAI'12 paper

In our AAAI'12 paper we suggest the following form of the utility function
$f(u,i) = a_u + b_i + W_u' H_i + \sum_{j \in I(u)} w_{ij} g(r_j)$
This appears to be quite similar to our local (log) probability in the pseudo-likelihood method. Of course the fine details will be different but in essence the information captured by both the approaches may be similar. Another subtle difference is that in the MRF treatment, the pairwise interactions are symmetric (i.e., $$w_{ij} = w_{ji}$$), while in this model, it is asymmetric $$w_{ij} \ne w_{ji}$$.

Previous postsPart 1: Modelling local dependencyPart 2: Discovering latent spaces.