Laplace's approximation

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.^[1]^[2] The approximation is justified by the Bernstein–von Mises theorem, which states that, under regularity conditions, the error of the approximation tends to 0 as the number of data points tends to infinity.^[3]^[4]

For example, consider a regression or classification model with data set $\{x_{n},y_{n}\}_{n=1,\ldots ,N}$ comprising inputs $x$ and outputs $y$ with (unknown) parameter vector $\theta$ of length $D$ . The likelihood is denoted $p({\bf {y}}|{\bf {x}},\theta )$ and the parameter prior $p(\theta )$ . Suppose one wants to approximate the joint density of outputs and parameters $p({\bf {y}},\theta |{\bf {x}})$ . Bayes' formula reads:

p({\bf {y}},\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}},\theta )p(\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}})p(\theta |{\bf {y}},{\bf {x}})\;\simeq \;{\tilde {q}}(\theta )\;=\;Zq(\theta ).

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood $p({\bf {y}}|{\bf {x}})$ and posterior $p(\theta |{\bf {y}},{\bf {x}})$ . Seen as a function of $\theta$ the joint is an un-normalised density.

In Laplace's approximation, we approximate the joint by an un-normalised Gaussian ${\tilde {q}}(\theta )=Zq(\theta )$ , where we use $q$ to denote approximate density, ${\tilde {q}}$ for un-normalised density and $Z$ the normalisation constant of ${\tilde {q}}$ (independent of $\theta$ ). Since the marginal likelihood $p({\bf {y}}|{\bf {x}})$ doesn't depend on the parameter $\theta$ and the posterior $p(\theta |{\bf {y}},{\bf {x}})$ normalises over $\theta$ we can immediately identify them with $Z$ and $q(\theta )$ of our approximation, respectively.

Laplace's approximation is

p({\bf {y}},\theta |{\bf {x}})\;\simeq \;p({\bf {y}},{\hat {\theta }}|{\bf {x}})\exp {\big (}-{\tfrac {1}{2}}(\theta -{\hat {\theta }})^{\top }S^{-1}(\theta -{\hat {\theta }}){\big )}\;=\;{\tilde {q}}(\theta ),

where we have defined

{\begin{aligned}{\hat {\theta }}&\;=\;\operatorname {argmax} _{\theta }\log p({\bf {y}},\theta |{\bf {x}}),\\S^{-1}&\;=\;-\left.\nabla _{\theta }\nabla _{\theta }\log p({\bf {y}},\theta |{\bf {x}})\right|_{\theta ={\hat {\theta }}},\end{aligned}}

where ${\hat {\theta }}$ is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and $S^{-1}$ is the $D\times D$ positive definite matrix of second derivatives of the negative log joint target density at the mode $\theta ={\hat {\theta }}$ . Thus, the Gaussian approximation matches the value and the log-curvature of the un-normalised target density at the mode. The value of ${\hat {\theta }}$ is usually found using a gradient based method.

In summary, we have

{\begin{aligned}q(\theta )&\;=\;{\cal {N}}(\theta |\mu ={\hat {\theta }},\Sigma =S),\\\log Z&\;=\;\log p({\bf {y}},{\hat {\theta }}|{\bf {x}})+{\tfrac {1}{2}}\log |S|+{\tfrac {D}{2}}\log(2\pi ),\end{aligned}}

for the approximate posterior over $\theta$ and the approximate log marginal likelihood respectively.

The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,^[5] and for Gaussian processes by Williams and Barber.^[6]

[1]

[2]

[3]

[4]

[5]

[6]

Laplace's approximation

References

Further reading

Wikiwand - on