OK so full disclosure: I’m sort of addicted to PAC Bayes at the moment. This post won’t do a good job of advertising the totality of its elegance. If that’s what you’re after, read the phenomenal introduction by Alquier (2024). Instead, I’m going to talk a bit about how the prior is discussed in PAC Bayes, and how different that is from the usual notion of an a priori belief.
PAC Bayes from zero to one: a bound in probability
Suppose we have a data sample \(\mathcal{S} = \{X_1,\ldots,X_n\}\). Going forward, I’m going to denote probability and expectation with respect to the data-generating measure as \(\mathbb{P}_\mathcal{S}\) and \(\mathbb{E}_\mathcal{S}\) respectively. For a loss parameterised by \(\theta\), which we denote \(\ell_\theta(X)\), now define an empirical risk computed on these data \(r(\theta) = n^{-1}\sum_{i=1}^n\ell_\theta(X_i)\), and an oracle risk satisfying \(R(\theta) = \mathbb{E}_\mathcal{S}\{r(\theta)\}\). Naturally, this oracle risk is unavailable to us but the empirical risk is computable from data.
We now take \(\pi\) a fixed “prior”1, \(\lambda\) a “learning rate”2, and \(C>0\) a constant which depends on the risk. Assume that our risk satisfies a Hoeffding inequality
\[ \mathbb{E}_{\theta\sim\pi}\mathbb{E}_\mathcal{S}\left[ e^{\lambda\{R(\theta) - r(\theta)\}}\right] \leq e^{\frac{\lambda^2C^2}{n}}. \tag{1}\]
This is satisfied immediately for bounded risks, and Alquier (2024) discusses how it has been extended far beyond that quite restrictive class of risks. The other result we need is the ubiquitous Donsker-Varadhan Lemma: for some fixed reference (or prior) measure \(\pi \in\mathcal{P}(\Theta)\) and any measurable function \(h:\,\Theta\to\mathbb{R}\),
\[ \log \mathbb{E}_{\theta\sim\pi}\left\{ e^{h(\theta)} \right\} = \sup_{\rho\in\mathcal{P}(\Theta)}\left[\mathbb{E}_{\theta\sim\rho}\{h(\theta)\} - d_{\mathrm{KL}}(\rho;\,\pi)\right]. \tag{2}\]
Something becomes quickly apparent when you dive into PAC Bayes literature3: the Donsker-Varadhan Lemma makes the world go round.
First, it is known that Bayesian inference can be recast as an optimisation problem (Knoblauch, Jewson, and Damoulas 2019):
\[ \pi_h = \operatorname{argmin}_{\rho\in\mathcal{P}(\Theta)}\left[\mathbb{E}_{\theta\sim\rho}\{h(\theta)\} - d_{\mathrm{KL}}(\rho;\,\pi)\right] \]
This result is immediate from the Donsker-Varadhan Lemma: the supremum is attained by the Gibbs measure \(\pi_h(\theta)\propto e^{h(\theta)}\pi(\theta)\) whence choosing \(h(\theta) = \log f_\theta(x_{1:n})\) returns the Bayes posterior and \(h(\theta) = -\tau \,\mathsf{L}_n(\theta;\,x_{1:n})\) returns the the Gibbs posterior under a loss \(\mathsf{L}_n\).
Second, if we have a full posterior \(\pi_h\) and define the variational approximation under the variational family \(\mathcal{F} \subset \mathcal{P}(\Theta)\) as the solution to
\[ \tilde \pi_h = \operatorname{argmin}_{\rho\in\mathcal{F}} d_{\mathrm{KL}}(\rho;\,\pi_h), \]
then by Equation 2 we can rewrite this as
\[ \tilde \pi_h = \operatorname{argmin}_{\rho\in\mathcal{F}}\left[\mathbb{E}_{\theta\sim\rho}\{h(\theta)\} - d_{\mathrm{KL}}(\rho;\,\pi)\right]. \]
That is, the variational posterior (i.e. the data-conditional measure closest to the full \(\pi_h\) in a restricted space) solves the same optimisation problem but under a restricted domain. And this second representation is then the objective we prefer to work with and which has been analysed, e.g., by Alquier, Ridgway, and Chopin (2015) and Alquier and Ridgway (2020).
Well, from Equation 1 and Equation 2 alone, we can produce a bound in-probability: that is, for any measure \(\rho\in\mathcal{P}(\Theta)\),4 a fixed prior \(\pi\), \(\lambda > 0\),5 and \(\delta\in(0,1)\), then with probability at least \(1 - \delta\) over the sample,
\[ \mathbb{E}_{\theta\sim\rho}\{R(\theta)\} \leq \mathbb{E}_{\theta\sim\rho}\{r(\theta)\} + \frac{d_{\mathrm{KL}}(\rho;\,\pi)}{\lambda} + \frac{\lambda C^2}{n} + \frac{\log(1/ \delta)}{\lambda}. \tag{3}\]
Localised priors and cheating the devil
Now, in order to achieve this bound, we needed the prior to be independent of data. This means that for these bounds to work we cannot accept a data-dependent prior. The tightness of the bound, however, does dependent on \(\pi\): broadly speaking, if the prior allocates more of its mass to “good” regions of \(\Theta\), then the bound is tighter.
In order to point the prior in this direction, Catoni (2007) proposes a so-called localised prior: for \(\beta > 0\)
\[ \pi_{\beta R}(\theta) \propto e^{-\beta R(\theta)}\pi(\theta). \]
Well, this isn’t data-dependent: \(R(\theta)\) the oracle risk doesn’t depend on data. Naturally, though, we can’t compute this: while the oracle risk would work, we don’t have access to it. So instead Catoni (2007) shows that with high probability,
\[ d_{\mathrm{KL}}(\rho;\,\pi_{\beta R}) \leq d_{\mathrm{KL}}(\rho;\,\pi_{\xi r}) \tag{4}\]
for \(\pi_{\xi r}(\theta) \propto e^{-\xi r(\theta)}\pi(\theta)\) and \(\xi = \beta / \{\lambda + g(\lambda/n)\lambda^2/n\}\) with \(g(\cdot)\) Bernstein’s function (Alquier 2024, Lemma 3.8). And this forms the procedure for simultaneously choosing a prior which tightens the bound in a way which doesn’t violate the theory, and which we can compute. We’ll see this come up again later.
Plugging \(\pi_{\xi r}\) into the bound of Equation 3 yields, with probability at least \(1 - \delta\) over the sample (Alquier 2024, Lemma 3.8)
\[ \mathbb{E}_{\theta\sim\rho}\{R(\theta)\} \leq \frac{(1 - \xi)\mathbb{E}_{\theta\sim\rho}\{r(\theta)\} + d_{\mathrm{KL}}(\rho;\,\pi_{-\xi \lambda r}) + (1 + \xi) \log(2/\delta)}{(1 - \xi)\lambda + (1 + \xi)g(\lambda/n)\lambda^2/n}, \tag{5}\]
which I’m told is usually tighter than Equation 3, even if this isn’t immediately obvious from the bound.
Simpler bounds in expectation
These bounds are, as you may have noticed, quite busy. Considering a bound in expectation over data usually leads to much more interpretable bounds, and also a new approach to prior choice. From the same starting point as Equation 3, for a data-dependent \(\hat\rho\)6 we can also achieve a bound of the form
\[ \mathbb{E}_\mathcal{S}\mathbb{E}_{\theta\sim \hat \rho}\{R(\theta)\} \leq \mathbb{E}_\mathcal{S}\left[ \mathbb{E}_{\theta\sim \hat \rho}\{r(\theta)\} + \frac{d_{\mathrm{KL}}(\hat \rho;\,\pi)}{\lambda} + \frac{\lambda C^2}{n} \right]. \tag{6}\]
This bound is much cleaner, and at this point we might consider directly optimising for \(\pi\): \[\begin{align*} \mathbb{E}_\mathcal{S}\mathbb{E}_{\theta\sim \hat \rho}\{R(\theta)\} &\leq \inf_{\pi\in\mathcal{P}(\Theta)} \mathbb{E}_\mathcal{S}\left[ \mathbb{E}_{\theta\sim \hat \rho}\{r(\theta)\} + \frac{d_{\mathrm{KL}}(\hat \rho;\,\pi)}{\lambda} + \frac{\lambda C^2}{n} \right] \\ &\leq \mathbb{E}_\mathcal{S} \mathbb{E}_{\theta\sim \hat \rho}\{r(\theta)\} + \inf_{\pi\in\mathcal{P}(\Theta)} \mathbb{E}_\mathcal{S}\left\{ \frac{d_{\mathrm{KL}}(\hat \rho;\,\pi)}{\lambda} + \frac{\lambda C^2}{n} \right\}. \end{align*}\] At this point it’s clear that the infimum would be attained by \(\pi = \hat\rho\). But this isn’t a valid prior: it is data-dependent, and thus forbidden by our theory.
Mutual information bounds and a return to the local
Instead, consider writing
\[ \mathbb{E}_\mathcal{S} \{d_{\mathrm{KL}}(\hat \rho;\,\pi)\} = \underbrace{\mathbb{E}_\mathcal{S} \{d_{\mathrm{KL}}(\hat \rho;\,\mathbb{E}_\mathcal{S} \hat\rho)\}}_{\mathcal{I}(\theta,\mathcal{S})} + d_{\mathrm{KL}}(\mathbb{E}_\mathcal{S} \hat\rho;\,\pi) \tag{7}\]
where \(\mathcal{I}(\theta,\mathcal{S})\) is the mutual information between \(\hat\rho\)7 and \(\mathcal{S}\). Well, minimising this instead tells us that we should choose \(\pi = \mathbb{E}_\mathcal{S} \hat\rho\) to set the second term equal to zero. This choice leads to a so called “mutual information bound”:
\[ \mathbb{E}_\mathcal{S}\mathbb{E}_{\theta\sim \hat \rho}\{R(\theta)\} \leq \mathbb{E}_\mathcal{S}\left[ \mathbb{E}_{\theta\sim \hat \rho}\{r(\theta)\} + \frac{\mathcal{I}(\theta,\mathcal{S})}{\lambda} + \frac{\lambda C^2}{n} \right], \]
whence, choosing \(\lambda= \sqrt{n\mathcal{I}(\theta,\mathcal{S})/C^2}\) yields
\[ \mathbb{E}_\mathcal{S}\mathbb{E}_{\theta\sim \hat \rho}\{R(\theta)\} \leq \mathbb{E}_\mathcal{S}\left[ \mathbb{E}_{\theta\sim \hat \rho}\{r(\theta)\} + \sqrt{\frac{\mathcal{I}(\theta,\mathcal{S})}{2n}} \right]. \tag{8}\]
Well, since we know that \(\mathcal{I}(\theta,\mathcal{S}) \leq \mathbb{E}_\mathcal{S}\{d_{\mathrm{KL}}(\hat \rho;\,\pi)\}\) by Equation 7, we know that Equation 8 is tighter than Equation 6. And that’s exactly what we wanted from the prior!8
But this was under the choice of \(\pi = \mathbb{E}_\mathcal{S} \hat\rho\), which we again can’t compute in practice. Instead, Catoni (2007) proposes to approximate this with \(\mathbb{E}_\mathcal{S} \hat\rho \approx \pi_{-\beta R}\), and then continue as before from Equation 4.
References
Footnotes
What is important here is that \(\pi\) not depend on data. Whether or not it is really a “prior” belief distribution is sort of irrelevant as far as the maths is concerned.↩︎
If you read my previous post on power posteriors, this quantity is in fact exactly \(\lambda = \tau^{-1}\) where \(\tau\) is the temperature in the power posterior. This connection will become clearer in a moment.↩︎
And beyond, of course. This result is not limited to PAC Bayes, and has been used widely in general. I’m just more aware of its uses here.↩︎
Which is absolutely continuous with respect to the prior \(\pi\).↩︎
We will later take these things as a function of \(n\), usually also to tighten this bound and usually at rate \(\lambda \asymp \sqrt{n}\).↩︎
We really only case about data-dependent \(\hat \rho\) anyway, so while the theory holds for any measure in this section we’re going to only talk about the data-dependent ones.↩︎
Confusingly, this is written in terms of \(\theta\) but refers to its distribution: \(\theta\sim\hat\rho\).↩︎
At least that’s what we wanted from it in this PAC Bayesian setting.↩︎
Citation
@online{mclatchie2025,
author = {Yann McLatchie},
title = {Prior Choice in {PAC} {Bayes}},
date = {2025-04-25},
url = {https://yannmclatchie.github.io/blog/posts/pac-priors},
langid = {en}
}