Machine learns as data speak: 2014

Tuesday 17 June 2014

Mixed-type data analysis II: Pairwise models

Analysing mixed-type data is hard. A popular way is to consider a pair of types, e.g., continuous and binary. A simple method is to utilise the standard rule \( P(A,B) = P(A)P(B|A) \), were \(A\) can be continuous and \(B\) can be binary. In this particular case, \(P(A)\) can be a Gaussian distribution and \(P(B|A)\) a logistic distribution, where A plays some role in the location, intercept or scale parameters of \(P(B|A)\).

For example, you wonder the relation between IQ (\(A\)) and having a top job (\(B\)). The distribution of IQ \(P(A)\) can be easily estimated from the population, but the conditional distribution of having a top job given your IQ (any nothing else) \(P(B|A)\) requires a bit of thinking. So you may want to consider IQ as a feature and estimate its weight. You may find some positive weight there, but it may not be as strong as you might wish. You may then question such a linear relationship and start modeling a polynomial equation, or chopping up the IQ into ranges (oops, this creates another issue, really). For example, it could be the case that after a certain IQ threshold, job success does not really depend on IQ anymore (possibly not every job is like that, especially in theoretical physics and maths).

To compute the conditional distribution of IQ given job success \(P(A|B)\) we need the Bayes' rule: \(Q(A|B) = P(A)P(B|A) / P(B)\), where \(P(B)\) is a normalising factor.

Alternatively, you can use the same rule, but differently: \(P(A,B) = P(B)P(A|B)\). Again \(P(B)\), the proportion of working people with top jobs, can be estimated from some national statistics. The IQ distribution among successful people P\((A|B)\). You may find that \(P(A|B)\) may not even be a Gaussian but rather skewed to the right. I don't know for sure, just a guess.

It is interesting to compare \(Q(A|B)\) and \(P(A|B)\) using the two methods. In a perfect world, they should be the same, but in practice they may not. This is because of the assumptions in the model specification and the unreliable estimation. This is one problem.

Another problem with this approach is that it is limited to pairs and cannot be easily generalised to more than two types. Even more, the number of models will be quadratic in number of variables. And finally, it does not offer an easy way for further analysis using existing tools (such as visualisation in 2D or 3D).

A better way is to imagine there exists some hidden variables that govern the generation of these types. Given these hidden variables, types are conditionally independent, and thus we don't have to really model the interaction among types. We have to, however, just model the interaction between each type and the hidden variables. These are techniques behind our recent attempts: mixed-variate restricted Boltzmann machines and Thurstonian Boltzmann machines. These are the subjects of subsequent posts.