A belief distribution is a Bayesian interpretation of probability. Image source

During my PhD study at Uppsala University, I was very interested in understanding the fundamental theory of various approximated Gaussian process inference methods. Many papers and benchmarks were comparing experimental accuracy and computational complexity. But the question I had is that how we can theoretically show that one approximation method is better than the other, in the sense that the better approximation gives a result which is closer to the true Gaussian process posterior distribution. The closeness of an approximated distribution and the true posterior distribution can be measured by the Kullback-Leibler (KL) divergence.

In those discussions, my supervisor Dave suggested me to read the paper A General Framework for Updating Belief Distributions (Bissiri et.al.). I found this paper very inspiring:

  1. The paper gives a general belief updating framework (referred as “the framework” in the following) for scenarios where the true model does not exists or the true model is difficult to be calculated.
  2. The framework help me to understand the fundamental mechanism of combining belief distributions in machine learning (e.g., sequential or distributed belief updating).
  3. The framework can take even non-stochastic data as the information to update the belief distribution (e.g., a domain knowledge which says the target variable is close to a certain value).

A general belief updating framework

Let’s consider an statistical inference problem, where we want to obtain a belief distribution $\widehat{q}(\theta)$ given data $X$ and a prior $p(\theta)$. The general belief updating framework formulate this problem as an optimization problem:

\[\widehat{q}(\theta) = \underset{q(\theta)}{\arg\min}\ \int_{\mathcal{\Theta}} l(X ; \theta) q(\theta) \: d\theta + D(q(\theta), p(\theta)),\]

where $l(X ; \theta)$ is the data dependent loss function and $ D(q(\theta), p(\theta)) $ is the KL divergence between the belief distribution and the prior.

The above objective function is doing two things:

  1. Fitting the target $\theta$ to the data $X$ by minimizing the loss function $l(X ; \theta) q(\theta)$;
  2. Minimizing the KL divergence to the prior, which can be seen as a regularization term.

Case 1: M-closed

When we know the true data generating model $p( X | \theta )$ and it is not too difficult or computationally expensive to be evaluated, this is referred as the M-closed case. In this case, the obvious choice is to use the negative log-likelihood function $-\ln p(X | \theta)$ as the data dependent loss function. Plug in the negative log-likelood function to the framework, we have

\[\begin{equation} \begin{split} \widehat{q}(\theta) =& \underset{q(\theta)}{\arg\min}\ \int_{\mathcal{\Theta}} -\ln p(X | \theta) q(\theta) \: d\theta + D(q(\theta), p(\theta)) \\ =& \underset{q(\theta)}{\arg\min}\ \int_{\mathcal{\Theta}} -\ln p(X | \theta) q(\theta) \: d\theta + \int_{\mathcal{\Theta}} \ln\frac{q(\theta)}{p(\theta)} q(\theta) \: d\theta \\ =& \underset{q(\theta)}{\arg\min}\ \int_{\mathcal{\Theta}} \ln\frac{q(\theta)}{p(X | \theta)p(\theta)} q(\theta) \: d\theta \\ =& \underset{q(\theta)}{\arg\min}\ D(q(\theta), p(X | \theta) p(\theta)) \\ \propto& p(X | \theta) p(\theta), \end{split} \end{equation}\]

which recovers the posterior distribution of $\theta$ given data $X$ (the Bayesian theorem).

An interesting thought is, what would happen if we don’t regularize the data dependent loss function with the KL divergence to the prior? In this case, the updated belief distribution will shrink to a Dirac pulse at the minimum of the data dependent loss function, which gives the maximum likelihood estimate (MLE) of $\theta$. In other words, Bayesian inference is a regularized version of the MLE solution.

Case 2: M-open

However, often we don’t have the true data generating model, or the true model is too difficult to be evaluated given a large data set. In this case, $p( X | \theta )$ is our proxy model of the unknown true model. And the appropriate loss function is still the negative log-likelihood using the proxy model. Furthermore, the paper stated that the general belief updating framework will solve the fundamental issue of Bayesian inference in the M-open case – we are not make assumptions on the model, but only defining a data dependent loss function.

Scalable belief updating

For Gaussian process regression, evaluating $-\ln p( X | \theta )$ has a computational complexity of $\mathcal{O}(N^3)$ (due to the covariance matrix inversion). This becomes a problem when we have a large amount of data or the computational power is limited (e.g., machine learning for IoT).

In this case, we could use the composite likelihood $\prod_{n=1}^{N} p( X_n | \theta )$ as our proxy model, where $X_n$ is a smaller block of the very large data. This enables two ways of scalable belief updating:

  1. Sequential belief updating: The idea is we start with the prior and the first block of the data, generate the first updated belief distribution, and then use the updated belief distribution as the new prior to combine with the second block of data. This is not a new idea, see the book Bayesian filtering and smoothing by Särkkä for more details.

  2. Distributed belief updating: The idea is we split the data into $N$ blocks and assign each to an agent, the agents update their local belief distributions in parallel. Then those local beliefs are combined in a network centre. This is not a new idea either, see the paper A Bayesian committee machine by Tresp. But what’s interesting is the previous papers about the Bayesian committee machine method seems to have a hard time to figure out the optimal weights of combining the local beliefs. Using the general belief updating framework, you can easily see that you should down-weight the prior based on the number of agents to prevent the conservativeness issue (the combined belief distribution degenerate to the prior when the number of agents goes to infinite).

Using the non-statistic information

Sometimes we get a piece of information from a domain expert, such as “$\theta$ is close to 0”. The paper shows how we can define a loss function (for example, $w\theta^2$ for $w > 0$) based on this information and put it into the general belief updating framework. This is very interesting to me as we often get this kind of information when applying machine learning to the business world.

Conclusion

All models are wrong, but some are useful. From the belief updating point of view, the model is just a loss function to connect the target and the data. And we can regularize the loss with a prior belief, otherwise the belief updating falls back to maximum likelihood. The general belief updating framework presented in this paper has changed my understanding of statistical inference, or has updated my beliefs :)

References

[1] Bissiri, Pier Giovanni, Chris C. Holmes, and Stephen G. Walker. “A general framework for updating belief distributions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78.5 (2016): 1103-1130.

[2] Bernardo, José M., and Adrian FM Smith. Bayesian theory. Vol. 405. John Wiley & Sons, 2009.

[3] Tresp, Volker. “A Bayesian committee machine.” Neural computation 12.11 (2000): 2719-2741.

[4] Quinonero-Candela, Joaquin, and Carl Edward Rasmussen. “A unifying view of sparse approximate Gaussian process regression.” The Journal of Machine Learning Research 6 (2005): 1939-1959.