Or: parameter uncertainty isn’t as important as data and model uncertainty when it comes to forecasting.
Prediction
Generalised Bayesian inference
Divergence
Author
Yann McLatchie
Published
April 11, 2025
Note
This is the dynamic accompaniment to my paper with Edwin Fong, David T. Frazier, and Jeremias Knoblauch (McLatchie et al. 2024). It includes two things which the paper didn’t: a dynamic plot which I used as an excuse to learn some Observable JS, and a triangle inequality argument.
In a recent paper, we looked at so-called power posteriors. These are posteriors where the likelihood is taken to some power we call the temperature\(\tau\) to remedy some of the failings of standard Bayesian inference. For a prior \(\pi\), statistical model \(f_\theta\), and data \(x_{1:n}\) they take the form
Most of the literature interests itself in the fundamental question “how should I choose \(\tau\)?” Indeed, it has been choosen in the past for regulating various forms of calibration [Syring and Martin (2019)]1, matching prior expected information gain (Holmes and Walker 2017), differential privacy (Wang, Fienberg, and Smola 2015), a non-conventional type of robustness (Miller and Dunson 2019), and a host of other properties and downstream tasks. In this paper, we instead look to chose \(\tau\) as a function of the posterior predictive distribution it induces:2
Our paper asks the question “is it possible to choose \(\tau\) to optimise for predictive performance?” and ultimately provides the answer: “no, not really.” Concretely, we show that as \(n\) increases, \(p_n^{(\tau)}(\cdot\mid x_{1:n})\) and the plug-in predictive \(f_{\hat \theta_n}(\cdot\mid x_{1:n})\) become uniformly close over \(\tau\). And that as a result, even for moderate sample sizes, varying \(\tau\) will not meaningfully improve predictive performance.
As long as the posterior distribution \(\pi_n^{(\tau)}\) concentrates around the population-optimal \(\theta^\star\) at some rate \(\varepsilon_n\downarrow0, n\varepsilon_n^2/M_{\varepsilon_n}\uparrow\infty\),3 and \(f_\theta\) satisfies a weak differentiability condition, for any \(\tau\) taken on a positive and compact interval.
For \(\mathbb{P}\) denoting the true data-generating measure, we would have ideally liked to choose \(\tau\) to minimise \(d_{\mathrm{TV}}\{\mathbb{P},\,p_n^{(\tau)}(\cdot\mid y_{1:n})\}\). Naturally, this is infeasible in practice since we never know \(\mathbb{P}\). But even if it were possible, we can use Lemma 1 above to demonstrate why the improvements afforded by tuning \(\tau\) vanish rapidly with \(n\). By the triangle inequality, we can bound the ideal target discussed previously by
From here we find that the first term on the right-hand side of the above display depends on \(\tau\). And our Lemma 1 demonstrates that this first term decays to zero extremely rapidly for \(\tau\) in any positive and compact interval, while the second term is independent of \(\tau\). Thus, \(\tau\) influences the bound only through this first term. As model misspecification only affects the magnitude of the second term on the right hand side, the so-called “misspecification error”, our findings apply equally to both well-specified and misspecified settings.
Consequently, choosing the temperature to optimise predictive performance is an ill-defined problem. Our results constitute formal evidence of the common folklore that, in terms of predictive accuracy, parameter uncertainty is of second-order importance relative to data and model uncertainty. In particular, predictive distributions obtained with different temperatures, and thus different assessments of posterior uncertainty, merge inline with posterior concentration. This is not without precedence, and is often employed informally to justify using posteriors based on approximate and simulation-based inference to form forecast distributions: even though they produce slightly different measures of posterior uncertainty, their predictives are quickly indistinguishable.
We can in fact demonstrate this empirically. Consider a normal location model, where for \(n = 10\) we sample \(x_{1:10}\sim N(0,1)\) having empirical mean \(\bar x = -0.15\), say. We compute the posterior and posterior predictive distributions under a standard Gaussian prior \(\pi(\theta) = N(\theta;\,0,1)\) for a given value of \(\tau\). Below, you can vary that value of tau.
Show code
viewof tau = Inputs.range([0.01,5], {label:"Temperature (tau)",value:0.1,step:0.01})// Gaussian pdffunctionnormalPDF(x, mean, sigma2) {const coeff =1/ (Math.sqrt(sigma2 *2*Math.PI));const exponent =-((x - mean) **2) / (2* sigma2);return coeff *Math.exp(exponent);}// function that returns an array of points {x, y}data = () => {// some constantsconst mu_0 =0;const sigma_0 =1;const sigma =1;const x_bar =-0.15;const n =10;// data-generatingconst data = [];for (let x =-3; x <=3; x +=0.01) {const sigma_n2 =1/ (n * tau / sigma **2+1/ sigma_0 **2)const mu_n = sigma_n2 * (mu_0 / sigma_0 **2+ (n * tau * x_bar) / sigma **2) data.push({ x: x,"theta": x,"posterior density":normalPDF(x, mu_n, sigma_n2),"predictive density":normalPDF(x, mu_n, sigma_n2 + sigma) }); }return data;}// render plotPlot.plot({subtitle:"Posterior distribution",aspectRatio:1,height:200,marks: [ Plot.line(data(), { x:"theta",y:"posterior density" }) ]})
While for very (very) small values of \(\tau\), the specific choice of \(\tau\) affects the form of the posterior predictive, there exists an infinite range of \(\tau\) (in this case \(\tau \gtrsim 0.5\)) inducing functionally equivalent posterior predictive distributions.
Important
The uncertainty in the posterior predictive distribution, \(\int f_\theta(\cdot\mid x_{1:n})\,\mathrm{d}\pi_n^{(\tau)}(\theta\mid x_{1:n})\), is dominated by the uncertainty in the data and model choice, \(f_\theta(\cdot\mid x_{1:n})\) as opposed to the uncertainty in the posterior \(\pi_n^{(\tau)}\), and thus the choice of \(\tau\).
The rest of our paper discusses how this result can be shown in expectation over data samples, and under different distances. I’ve also given a talk on the paper which is freely available on YouTube.
References
Holmes, C. C., and S. G. Walker. 2017. “Assigning a Value to a Power Likelihood in a General Bayesian Model.”Biometrika 104 (2): 497–503. https://doi.org/10.1093/biomet/asx010.
McLatchie, Yann, Edwin Fong, David T. Frazier, and Jeremias Knoblauch. 2024. “Predictive Performance of Power Posteriors.” arXiv. http://arxiv.org/abs/2408.08806.
Miller, Jeffrey W., and David B. Dunson. 2019. “Robust Bayesian Inference via Coarsening.”Journal of the American Statistical Association 114 (527): 1113–25. https://doi.org/10.1080/01621459.2018.1469995.
Syring, Nicholas, and Ryan Martin. 2019. “Calibrating General Posterior Credible Regions.”Biometrika 106 (2): 479–86. https://doi.org/10.1093/biomet/asy054.
Wang, Yu-Xiang, Stephen Fienberg, and Alex Smola. 2015. “Privacy for Free: PosteriorSampling and StochasticGradientMonteCarlo.” In Proceedings of the 32nd InternationalConference on MachineLearning, edited by Francis Bach and David Blei, 37:2493–2502. Proceedings of MachineLearningResearch. Lille, France: PMLR. https://proceedings.mlr.press/v37/wangg15.html.
Footnotes
Ryan Martin spoke about this in a recent talk at the post-Bayes seminar series I help organise, check it out!↩︎
We condition on \(x_{1:n}\) just to show that our results make no assumption on the data being iid.↩︎
In this post I’ll only consider a fixed \(\tau\), but our results can be extended to sequences of \(\tau_n\downarrow0\) so long as \(n\tau_n\varepsilon_n^2\uparrow\infty\) and thus still allow for posterior concentration.↩︎
Citation
BibTeX citation:
@online{mclatchie2025,
author = {Yann McLatchie},
title = {Predictive Performance of Power Posteriors},
date = {2025-04-11},
url = {https://yannmclatchie.github.io/blog/posts/power-posterior},
langid = {en}
}