Finding the Posterior: Generative Models

Motivations and Optimisation

No, not that posterior, this one: \(p(v|u)\). The probability of causes given outcomes. Finding such a probability, or even estimating it, lies at the crux of many learning algorithms. And one approach to finding such a posterior is via the use of generative models.

We can motivate this by first considering the fact that learning recognition densities de novo for many dynamic processes is often very difficult. This is generally the case because the nature of the underlying generative processes are not only almost always highly non-linear, but also because they often involve a non-invertible mixing of causes to create outcomes.

Many traditional machine learning methods seek to obtain this recognition density (or some point estimate of it) through the use of feed forward networks. (MLP/CNNs etc.) Such feed forward networks are now understood to work as very flexible function approximators, seeking a particular \(y = f(x)\), for which a loss quantity is minimised. Such loss quantities can themselves sometimes be motivated by assumptions about the underlying recognition density. Unfortunately, training such models to generalise well without over or under fitting is a difficult task, and often requires large amounts of data. Furthermore, training such models require knowledge of the true causes \(v\) of the dataset, and therefore this class of techniques falls under supervised learning.

An alternative approach to obtaining the recognition density is to instead focus on modelling the underlying generative process itself. And in so doing, make it easier to find the true underlying recognition density. What we seek to do then, is to construct a generative model, \(p(u|v;\theta)\), and subsequently find the posterior density \(p(v|u;\theta)\) implied by such model.

For now let us ignore how one finds this generative model, and instead focus on how to obtain the posterior (recognition) density from such a generative model. So how does one do this last step exactly? The most obvious avenue would be of course to exploit Bayes theorem. This has just one problem however, namely that Bayes theorem requires us to calculate the marginal likelihood, \(p(u)\), which involves calculating an intractable integral over all possible causes \(v\).

Fortunately, we can finesse this problem by instead focussing our efforts on modelling an approximate posterior density, \(q(v|u)\), and seeking to make this approximate density as close to the real posterior density as possible. We do this by minimising a quantity called the Kullback–Leibler divergence, which measures the degree to which one probability distribution differs from another. \(KL(q(v|u;\phi)||p(v|u,m;\theta)\)

This alone does not solve our problem since the KL divergence still requires us to know the posterior density. But such a problem can be solved when we now return to consider how we find a generative model that best describes or approximates the true underlying generative model. This, we do, by maximising the ”model evidence”. I.e. the marginal likelihood of the data given the model, \(p(u|m)\). Which, you guessed it, also involves an intractable integral sum over causes \(v\)

It appears that we have now found ourselves at two suspiciously similar looking cul de sacs. Fortunately for us, as is often true, when life closes two doors, it opens another entirely. As such, we can combine these two intractabilities to make one tractability in a process akin to mathematical alchemy. To do so, we adapt both our generative model, and the approximate recognition density by seeking to maximise the same quantity. A quantity which if you spend long enough looking at the literature, pops up quite often, and often with very different names. (Negative free energy, Evidence Lower Bound (ELBO), etc.)

Explicitly, this quantity is as follows:

$$ F(u) = log(p(u|m)) - KL(q(v;\phi)||p(v|u,m;\theta)) $$

Which can often be reformulated in ways that are tractable. Here, I have made the dependence of model evidence on the model explicit, as well as making it explicit that \(p(v|u,m;\theta)\) is the recognition density implied by our generative model \(m\).

By seeking to maximise this quantity, we do two things. First, we ensure that our generative model is one that maximises its model evidence. And two, we ensure that our approximate recognition density is one that resembles the true posterior density. This is ensured by the fact that the KL divergence is by definition non-negative, and so the quantity F will always act as a lower bound for the model evidence.

The exact process of optimising both sets of (mutually dependent) parameters \(\theta\) and \(\phi\) with respect to such a quantity is best left to another post entirely.

Disclaimer: The above is my current understanding of the motivation behind, and process of, optimising generative models. As such, it is subject to change as my understanding of the topic develops. It may therefore also have mistakes present. Feel free to email me, or comment with any corrections!