Machine learns as data speak: Markov random fields: Parameter learning I

In this post, I will review some practical learning methods which are applicable for large-scale Markov random fields (MRFs), e.g., with millions of parameters. This has some overlaps with this post, but explains the ideas in more details.

Here I use the term "Markov random fields" to refer to a class of multivariate models whose (density) probability distributions involves a normalisation constant \( Z \) known as the partition function:
\[ P(\mathbf{x};\theta) = \frac{1}{Z(\theta)} \Psi(\mathbf{x}) \]
where \( \theta \) is model parameter.

This also includes conditional random fields (CRFs), Markov relational networks (MRN), and to some extent, Markov logic networks. The main difference with CRFs and other conditional models is that the partition function is input dependent, that is, we need to write \( Z(\mathbf(y),\theta) \), i.e., we need to compute the partition function for every data point. A similar situation happens when the graph of the MRF is data dependent, e.g., in our previous application of MRFs for collaborative filtering, we build a MRF for each user to exploit the sparsity in the data. On the other hand, in non-conditional MRF, we need only one partition function.

The estimation of \( Z(\theta) \) is general intractable except for the tree-structured MRFs, and indeed if we know how to estimate \( Z(\theta) \), we are likely to be able to infer most quantities (e.g., expectation of some features). For example, Wainwright has demonstrated that by minimising its upper bound, we can arrive at a quite stable message-passing algorithm, from which marginals can be estimated efficiently.

However, the approximate estimation of \( Z(\theta) \) can still be very expensive, especially in the setting of learning because it depends on the model parameters being estimated, i.e., for every parameter updates, we need to recompute \( Z(\theta) \). The trick is to avoid computing it explicitly.

The maximum likelihood

Denote my \(G = (V,E)\) the underlying graph where \( V \) is a set of nodes, and \( E \) is a set of edges. For simplicity, we will assume the log-linear parameterisation for edges and nodes, that is, the model potential function will have the following energy-based form
\[
\Psi(\mathbf{x}) = \exp( - F(\mathbf{x}) )
\]
where the energy \( F(\mathbf{x}) \) is decomposed into parts
\[
F(\mathbf{x}) = -\left( \sum_{i \in V} \sum_k\theta_{ik} f_{ik}(x_i) + \sum_{ij \in E} \sum_m\theta_{ijm} g_{ijm}(x_i,x_j)\right)
\]
where \(f_{ik}(x_i) \) and \( g_{ijm}(x_i,x_j) \) are feature functions.

Note that this model is somewhat overparameterised, but this form will be of much help when we have to deal with rich input features in the conditional setting (aka conditional random fields).

For now let us assume that the training data is fully observed. We will cover the partially observed variables later, e.g., for the case of restricted Boltzmann machines (RBM).

Maximum likelihood estimation starts with the data log-likelihood
\[
L(\theta) = \frac{1}{D}\sum_{l=1}^D \log P(\mathbf{x}^{(l)};\theta)
\]
where \( D \) is the number of observations. This criterion is indeed statistically powerful as it leads to consistent and fast estimate in the asymptotic sense.

This can be rewritten as
\[
L(\theta) = - \frac{1}{D}\sum_{l=1}^D E(\mathbf{x}^{(l)}) - \log Z(\theta)
\]

Note that the partition function does not depend on the the observation in this setting, while it depends on the input in the conditional setting.

The first part of the right-hand-side is easier to compute, but the partition function is still there. The nice property of the log-partition function is that it is convex in \( \theta \). Thus if we can find a way to maximise the log-likelihood, we are guaranteed to obtain the global maximum.

In a typical optimisation, we would need the gradient of the log-likelihood. For example, for pairwise parameter:
\[
\frac{\partial L}{\partial \theta_{ijm}} = \frac{1}{D}\sum_{l=1}^D g_{ijm}(x_i^{(l)},x_j^{(l)})-
\sum_{x_i,x_j} P( x_i,x_j) g_{ijm}(x_i,x_j))
\]
In the literature, especially those associated with the Boltzmann machines, the computation of the first part of the right-hand-side is often referred to as the clamped phase and the second part as the free phase. This is because in the former, variables are clamped to their specific observations, while in the latter, variables are free to vary according to the distribution.

Another view is that the first part is the expectation of feature function with respect to the empirical distribution, while the second part is with respect to the model distribution. The maximum likelihood solution is equivalent to solving the gradient equation, e.g., by matching the two expectations (or moments). Thus, sometimes the term moment-matching is used.

We now have three options for learning.
1. Approximate objective function
2. Approximate Z
3. Approximate gradients

As for optimisation, we may not need a full evaluation of the log-likelihood, and thus the partition function is not needed. The idea here is to go with the gradient only, and hopefully the learning still progresses as we expect.

In the subsequent posts, we will cover some of the following methods:

Pseudo-likelihood
Structured pseudo methods
Approximate gradients with truncated MCMCs, aka (Persistent) Contrastive Divergence
Piecewise methods
Tree-based methods
Herding
Label relaxation
Structured mean-fields
Approximate perceptron
Wainwright moment matching
Conditional versus non-conditional
Mean-field approximation as Recurrent Neural Networks.

Machine learns as data speak

Saturday 17 December 2016

Markov random fields: Parameter learning I

No comments:

Post a Comment