$\mathbb{P}$robably Approximately Wrong

An infrequent blog, by Nicola Branchini

Why do sampling and estimating the normalizing constant avoid each other?

We want to get samples from $p(\mathbf{x})$, exactly or approximately. Except for some cases (and for inverse transform sampling), in general, it does not matter at all whether we know the normalizing constant of $p$, i.e. $Z_{p} = \int \widetilde{p}(\mathbf{x}) d \mathbf{x}$ where $\widetilde{p}(\mathbf{x})$ is the unnormalized density (see e.g. this X-validated response by Xi’an). For example, in a Bayesian statistics context $\widetilde{p}(\mathbf{x})$ would be the product of likelihood and prior, $Z_{p}$ would be the model evidence. The most generic method (class of methods) to obtain samples from $p$ is Markov Chain Monte Carlo (MCMC), which by design avoids the need for $Z_{p}$. Importantly, even if we knew it (some oracle gave it to us), it would not help us at all (with the current methods I am aware of) in improving MCMC speed or otherwise. Another way is to use sampling importance resampling (SIR), which is based on self-normalized importance sampling (IS) plus (crucially) resampling. Also with this method, since the IS weights need to be normalized for resampling, the $Z_p$ cancels out.

Some other times, all we are interested in is approximating $Z_{p}$, viewed as nothing more than a numerical integration task. For example, for model comparison in Bayesian statistics. The way to do it with a randomized algorithm is Monte Carlo (MC), whose generalization (suited for this task) is IS. Suppose again, an oracle gives us samples from $p$. That’s great, it is provably the optimal density to sample from to minimize the IS variance of the estimator (hence MC variance). But again, it does not help as at all. We have nothing to do with those samples, because $Z_{p}$ is not an expectation w.r.t to $p$, so we cannot use plain MC, and IS by design requires that we know the normalizing constant of the proposal. If the samples come from $p$, then the proposal is $p$, and we don’t know its normalizing constant by definition (it is literally what we are trying to estimate).

Knowing the constant does not help us in sampling. Sampling does not help us in estimating the constant.

The statement would be more accurate adding caveats like (“by itself, does not help ..”).

In my view, there is something a bit unintuitive about all of this.
So what is going on here ? Is it just trivial that the two tasks should be uninformative about each other?

Cited as:

  title   = “Why do sampling and estimating the normalizing constant avoid each other?”,
  author  = Branchini, Nicola,
  journal = https://www.branchini.fun,
  year    = 2023,

comments powered by Disqus