The uninformative prior - Predictive performance of power posteriors

Note

This is the dynamic accompaniment to my paper with Edwin Fong, David T. Frazier, and Jeremias Knoblauch (McLatchie et al. 2024). It includes two things which the paper didn’t: a dynamic plot which I used as an excuse to learn some Observable JS, and a triangle inequality argument.

In a recent paper, we looked at so-called power posteriors. These are posteriors where the likelihood is taken to some power we call the temperature \(\tau\) to remedy some of the failings of standard Bayesian inference. For a prior \(\pi\), statistical model \(f_\theta\), and data \(x_{1:n}\) they take the form

\[ \pi_n^{(\tau)}(\theta\mid x_{1:n}) \propto \pi(\theta) f_\theta(x_{1:n})^\tau. \tag{1}\]

Most of the literature interests itself in the fundamental question “how should I choose \(\tau\)?” Indeed, it has been choosen in the past for regulating various forms of calibration (Syring and Martin 2019), matching prior expected information gain (Holmes and Walker 2017), differential privacy (Wang, Fienberg, and Smola 2015), a non-conventional type of robustness (Miller and Dunson 2019), and a host of other properties and downstream tasks.¹ In this paper, we instead consider choosing \(\tau\) as a function of the posterior predictive distribution it induces:²

\[ p_n^{(\tau)}(\cdot\mid x_{1:n}) = \int f_\theta(\cdot\mid x_{1:n})\,\mathrm{d}\pi_n^{(\tau)}(\theta\mid x_{1:n}). \tag{2}\]

Our paper asks the question “is it possible to choose \(\tau\) to optimise for predictive performance?” and ultimately provides the answer: “no, not really.” Concretely, we show that as \(n\) increases, \(p_n^{(\tau)}(\cdot\mid x_{1:n})\) and the plug-in predictive \(f_{\hat \theta_n}(\cdot\mid x_{1:n})\) become uniformly close over \(\tau\). And that as a result, even for moderate sample sizes, varying \(\tau\) will not meaningfully improve predictive performance.

Lemma 1 of McLatchie et al. (2024)

As long as the posterior distribution \(\pi_n^{(\tau)}\) concentrates around the population-optimal \(\theta^\star\) at some rate \(\varepsilon_n\downarrow0, n\varepsilon_n^2/M_{\varepsilon_n}\uparrow\infty\),³ and \(f_\theta\) satisfies a weak technical condition, for any \(\tau\) taken on a positive and compact interval.

\[ d_{\mathrm{TV}}\left\{p_n^{(\tau)}(\cdot\mid y_{1:n}),\,f_{\hat \theta_n}(\cdot\mid x_{1:n})\right\} \le 2\max\left\{\varepsilon_n+\exp(-Cn\tau\varepsilon_n^2/ M_{\varepsilon_n}), \nu_n\right\} \]

For \(\mathbb{P}\) denoting the true data-generating measure, we would have ideally liked to choose \(\tau\) to minimise \(d_{\mathrm{TV}}\{\mathbb{P},\,p_n^{(\tau)}(\cdot\mid y_{1:n})\}\). Naturally, this is infeasible in practice since we never know \(\mathbb{P}\). But even if it were possible, we can use Lemma 1 above to demonstrate why the improvements afforded by tuning \(\tau\) vanish rapidly with \(n\). By the triangle inequality, we can bound the ideal target discussed previously by

\[ \underbrace{d_{\operatorname{TV}}\{p_n^{(\tau)},\mathbb{P}\}}_{\text{ideal target}} \leq \underbrace{d_{\operatorname{TV}}\{p_n^{(\tau)},f_{\hat \theta_n}\}}_{\text{Lemma 1}} + \underbrace{d_{\operatorname{TV}}\{f_{\hat \theta_n},\mathbb{P}\}}_{\text{misspecification error}}. \]

From here we find that the first term on the right-hand side of the above display depends on \(\tau\). And our Lemma 1 demonstrates that this first term decays to zero extremely rapidly for \(\tau\) in any positive and compact interval, while the second term is independent of \(\tau\). Thus, \(\tau\) influences the bound only through this first term. As model misspecification only affects the magnitude of the second term on the right hand side, the so-called “misspecification error”, our findings apply equally to both well-specified and misspecified settings.

Consequently, choosing the temperature to optimise predictive performance is an ill-defined problem. Our results constitute formal evidence of the common folklore that, in terms of predictive accuracy, parameter uncertainty is of second-order importance relative to data and model uncertainty. In particular, predictive distributions obtained with different temperatures, and thus different assessments of posterior uncertainty, merge inline with posterior concentration. This is not without precedence, and is often employed informally to justify using posteriors based on approximate and simulation-based inference to form forecast distributions: even though they produce slightly different measures of posterior uncertainty, their predictives are quickly indistinguishable.

We can in fact demonstrate this empirically. Consider a normal location model, where for \(n = 10\) we sample \(x_{1:10}\sim N(0,1)\) having empirical mean \(\bar x = -0.15\), say. We compute the posterior and posterior predictive distributions under a standard Gaussian prior \(\pi(\theta) = N(\theta;\,0,1)\) for a given value of \(\tau\). Below, you can vary that value of tau.

Show code

viewof tau = Inputs.range([0.01, 5], {
  label: "Temperature (tau)",
  value: 0.1,
  step: 0.01
})

// Gaussian pdf
function normalPDF(x, mean, sigma2) {
  const coeff = 1 / (Math.sqrt(sigma2 * 2 * Math.PI));
  const exponent = -((x - mean) ** 2) / (2 * sigma2);
  return coeff * Math.exp(exponent);
}

// function that returns an array of points {x, y}
data = () => {
  // some constants
  const mu_0 = 0;
  const sigma_0 = 1;
  const sigma = 1;
  const x_bar = -0.15;
  const n = 10;
  
  // data-generating
  const data = [];
  for (let x = -3; x <= 3; x += 0.01) {
    const sigma_n2 = 1 / (n * tau / sigma ** 2 + 1 / sigma_0 ** 2)
    const mu_n = sigma_n2 * (mu_0 / sigma_0 ** 2 + (n * tau * x_bar) / sigma ** 2)
    data.push({ x: x, 
                "theta": x, 
                "posterior density": normalPDF(x, mu_n, sigma_n2), 
                "predictive density": normalPDF(x, mu_n, sigma_n2 + sigma) });
  }
  return data;
}

// render plot
Plot.plot({
  subtitle: "Posterior distribution",
  aspectRatio: 1,
  height: 200,
  marks: [
    Plot.line(data(), { x: "theta", y: "posterior density" })
  ]
})

Show code

Plot.plot({
  subtitle: "Predictive distribution",
  aspectRatio: 1,
  height: 200,
  marks: [
    Plot.line(data(), { x: "x", y: "predictive density" })
  ]
})

While for very (very) small values of \(\tau\), the specific choice of \(\tau\) affects the form of the posterior predictive, there exists an infinite range of \(\tau\) (in this case \(\tau \gtrsim 0.5\)) inducing functionally equivalent posterior predictive distributions.

Important

The uncertainty in the posterior predictive distribution, \(\int f_\theta(\cdot\mid x_{1:n})\,\mathrm{d}\pi_n^{(\tau)}(\theta\mid x_{1:n})\), is dominated by the uncertainty in the data and model choice, \(f_\theta(\cdot\mid x_{1:n})\) as opposed to the uncertainty in the posterior \(\pi_n^{(\tau)}\), and thus the choice of \(\tau\).

The rest of our paper discusses how this result can be shown in expectation over data samples, and under different distances. I’ve also given a talk on the paper which is freely available on YouTube.

References

Holmes, C. C., and S. G. Walker. 2017. “Assigning a Value to a Power Likelihood in a General Bayesian Model.” Biometrika 104 (2): 497–503. https://doi.org/10.1093/biomet/asx010.

McLatchie, Yann, Edwin Fong, David T. Frazier, and Jeremias Knoblauch. 2024. “Predictive Performance of Power Posteriors.” arXiv. http://arxiv.org/abs/2408.08806.

Miller, Jeffrey W., and David B. Dunson. 2019. “Robust Bayesian Inference via Coarsening.” Journal of the American Statistical Association 114 (527): 1113–25. https://doi.org/10.1080/01621459.2018.1469995.

Syring, Nicholas, and Ryan Martin. 2019. “Calibrating General Posterior Credible Regions.” Biometrika 106 (2): 479–86. https://doi.org/10.1093/biomet/asy054.

Wang, Yu-Xiang, Stephen Fienberg, and Alex Smola. 2015. “Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo.” In Proceedings of the 32nd International Conference on Machine Learning, edited by Francis Bach and David Blei, 37:2493–2502. Proceedings of Machine Learning Research. Lille, France: PMLR. https://proceedings.mlr.press/v37/wangg15.html.

Footnotes

Ryan Martin spoke about this in a recent talk at the post-Bayes seminar series I help organise, check it out!↩︎
We condition on \(x_{1:n}\) just to show that our results make no assumption on the data being iid.↩︎
In this post I’ll only consider a fixed \(\tau\), but our results can be extended to sequences of \(\tau_n\downarrow0\) so long as \(n\tau_n\varepsilon_n^2\uparrow\infty\) and thus still allow for posterior concentration.↩︎

Citation

BibTeX citation:

@online{mclatchie2025,
  author = {Yann McLatchie},
  title = {Predictive Performance of Power Posteriors},
  date = {2025-04-11},
  url = {https://yannmclatchie.github.io/blog/posts/power-posterior},
  langid = {en}
}

For attribution, please cite this work as:

Yann McLatchie. 2025. “Predictive Performance of Power Posteriors.” April 11, 2025. https://yannmclatchie.github.io/blog/posts/power-posterior.