Machine learns as data speak: Collaborative ranking II: The estimation

Designing the rank functional

Let us for now concentrate on the the point-wise rank functional which takes as input the pair (user,item) and returns a real score. Note that we do not mean point-wise loss functions, which will be discussed later in this post.

As inspired by the SVD approach in standard (real-valued) rating prediction (especially after the Neflix challenge), it is natural to go for a function of the form

\[ f(u,i) = W_u'V_i \]

where \( W_u \) is the feature vector of user \( u \), \( V_i \) is the feature vector of item \( i \). Indeed, this is not the only way because the function \( f \) is a measure of similarity in the feature space. So probably kernels are applicable here (this particular inner product is really a linear kernel).

Note that the features don't have to be discovered if they are available, e.g., user profiles and item content descriptions. Or the features can be mixed: some are latent (a.k.a. random effects), and some are given (a.k.a. fixed effects). Our forthcoming paper in AAAI'12 offers even more interesting fixed features: the popular items (and their associated ratings for this particular user) themselves can also be highly informative features.

What is more interesting is that the user features and item features don't have to stay in the same space. What we need to do is to build a bilinear functional form

\[ f(u,i) = W_u' A V_i \]

where \( A \) where is the association matrix. Since \( A \) may be huge, we may choose to factorise it into two smaller matrices for robust estimation, i.e., \( A = B'C \) where the ranks of \( B \) and \( C \) are small compared to the size of features. That is effectively equivalent to applying linear transforms to the user features and item features so that the projected features belong to the same space: \( W_u \rightarrow BW_u \) and \( V_i \rightarrow CV_i \).

The rank 1 problem

One of the annoying properties of the point-wise rank functional with the linear factorisation as discussed above is that some some datasets, the number of hidden features can be as little as 1 or 2. More precisely, the rank of the two matrices \( W, V \) can be 1 or 2. This is hugely different from standard SVD in rating prediction, where the effective dimensionality is often in the range of 50-500, depending on how much data we have. In fact, we don't even need any fancy learning, just ranking items according to their popularity will do a good job. See this paper for more discussion.

It has been suggested that we should prepare test data that contains only items which are not popular and are controversial (i.e., the rating variance/entropy is high among users). With more diversity, more features will be needed.

The true reasons behind this phenomenon are still unexplored. My guess is that in standard rating prediction, since it is important to get each rating right, we need to map the user's profile into a high dimensional space of item ratings. On the other hand, in ranking with point-wise rank functional, we just need to map into a single real line. Thus the number of bits needed for rating prediction is larger than what needed for ranking. My conjecture is that the number of bits needed grows with the degree of the polynomial which is best describing the rank functional.

The loss functions

In order to estimate the rank functional, we need to minimise some loss function. Ideally the loss function should be the error metric used in evaluation. Typically rank metrics discount the items deep in the list. Well-known examples include: NDCG at top T, MRR, MAP, and more recently the Yahoo! ERR. Unfortunately, these metrics are non-smooth, and thus it is difficult to optimise directly. A reasonable method is to approximate the rank by some smooth functions, for example

\[ \bar{r}_i = \sum_{j \ne i} P(f(i) > f(j)) \]

where \( P(f(i) > f(j)) \) is the probability that item \( i \) is ranked higher than item \( j \). One useful probability form is the sigmoid function

\[ P(f(i) > f(j)) = \frac{1}{1+e^{-\gamma(f(i) - f(j))}} \]

where \( \gamma > 0 \) is some scale parameter. The main problem with this approximate loss is that it can be non-convex and flat when the pairwise probability is close to 0 or 1 making optimisation not a straightforward exercise.

Another strategy is to apply some surrogate loss functions. It is desirable that the loss function takes the whole list into account (the listwise loss). For example, the Luce's model (subsequently rediscovered by Plackett) reads:
\[ P(f(1) > f(2)) > ... > f(M)) = \prod_{i=1}^M \frac{e^{f(i)}}{\sum_{j=i}^M e^{f(j)}} \]
However, there has been work proving that even pointwise losses and pairwise losses are good upper bounds of some popular rank metrics.

Taking ties into account

Up to this point, we haven't addressed the mismatch between the ratings as in training data and the ranking as in the test output. In order words, the labels are in the form of ordinal variables, while the the end goal is to produce a rank.

In theory it is easy to convert the ordinal values into ranks. However, since the number of ordinal levels is usually small (e.g., as in 5-star scale), there will be many tied ranks. Standard sorting typically ignores ties and thus gives you a rank where ties are broken arbitrarily depending on how you index your items. Thus much information is lost during this conversion.

A better way is to treat ties directly. Early methods in the statistical literature include the Rao-Kupper method in 1967 and Davidson method in 1970 for extending the standard pairwise Bradley-Terry model. Last year we made a contribution (published in SDM'11) to this area by treating the ties entirely in the list, without breaking them into pairs. The idea is that, ranking with ties can be seen as an instance of the problem class of simultaneous partitioning a set into subsets and ordering the subsets. It turns out the space of such combinatorial operation is extremely large -- it is known in combinatorics that the space size is the Fubini number or ordered Bell number. However, we don't have to evaluate such a combinatorial space when constructing the loss function. A nice result is that there exists at least a case where the loss function can be computed in linear time.

[Part 1] [Part 3]

Machine learns as data speak

Saturday 9 June 2012

Collaborative ranking II: The estimation

No comments:

Post a Comment