Inferring posterior forms
Inferring posterior forms in a Gaussian distribution can get really tricky with so many terms. There is a simple trick however that can save a lot of time. This post is in presenting this trick to save lot of calculations.
This post specifically addresses the following topics:
- Estimating parameters of a Gaussian for the posterior in Bayesian Inference
- Finding Maximum likelihood parameters for a multivariate Gaussian
Let \({\bf X}\) be a random vector that follows multivariate Gaussian distribution i.e,
\[{\bf X} \sim \mathcal{N}({\bf \mu}, {\bf \Sigma})\]where \({\bf \mu} = (\mu_1, \dots, \mu_D)^T\) is the mean vector and \({\bf \Sigma}\) is the covariance matrix. If \({\bf x}\) is instantiation of the above random variable,
\[\mathcal{N}({\bf x} \mid {\bf \mu}, {\bf \Sigma}) = \frac{1}{(2\pi)^{D/2}}\frac{1}{|{\bf\Sigma}|^{1/2}} \exp \left\{-\frac{1}{2}({\bf x}-{\bf \mu})^T\Sigma^{-1}({\bf x}-\mu)\right\}\]Expanding the exponent term gives;
\[-\frac{1}{2}({\bf x}-{\bf \mu})^T\Sigma^{-1}({\bf x}-{\mathbf{\mu}}) = -\frac{1}{2}{\bf x}^T{\bf \Sigma}^{-1}{\bf x} + {\bf x}^T{\bf \Sigma}^{-1}{\bf \mu} + constant\]where \(\text{constant}\) are terms that do not depend on \(\bf x\). All we care about in the exponent term is coefficient of quadratic term and the linear term.
Bayesian Inference for the Gaussian
Let us confine to univariate Gaussian distribution and divide the derivation into three cases.
case 1: \(\sigma^2\) is known
We shall find the parameter \(\mu\) given \(N\) independent observations \({\bf X} = \{x_1, x_2, \dots, x_N\}\). Then, the likelihood becomes;
\[\begin{align} p({\bf X} \mid \mu) & = \prod_{i=1}^{N}p(x_n \mid \mu) \\ & = \frac{1}{(2\pi\sigma^2)^{N/2}} \exp \left\{ -\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \end{align}\]Choosing the prior such that the posterior forms conjugate pair,
\[p(\mu) = \mathcal{N}(\mu \mid \mu_o, \sigma_o^2)\]Now, the posterior is given by;
\[\begin{align} p(\mu \mid {\bf X}) & \propto p(\mu) p({\bf X} \mid \mu) \\ & \propto \frac{1}{(2\pi\sigma_o^2)^{1/2}} \exp \left\{-\frac{1}{2\sigma_o^2}(\mu - \mu_o)^2 \right\} \frac{1}{(2\pi\sigma^2)^{N/2}} \exp \left\{ -\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \end{align}\]Looking at the sum in the exponent of the above terms, it is clear that it is of quadratic from confirming that the posterior is a Gaussian. Let the posterior gaussian be represented by \(\mathcal{N}(\mu \mid \mu_N, \sigma_N^2)\).To find the parameters of the Gaussian, we look for the quadratic and linear terms in the exponent for \(\mu\). Writing the exponent,
\[\begin{align} & = -\frac{1}{2\sigma_o^2}(\mu - \mu_o)^2 -\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n - \mu)^2 \\ & = -\frac{1}{2\sigma_o^2} (\mu^2 + \mu_o^2 - 2\mu\mu_o) - \frac{1}{2\sigma^2} \sum(x_n^2 + \mu^2 -2\mu x_n) \\ & = -\frac{1}{2}\left\{\left(\frac{1}{\sigma_o^2} + \frac{N}{\sigma^2}\right)\mu^2 - \left(\frac{2\mu_o}{\sigma_o^2} + \frac{2\Sigma x_n}{\sigma^2}\right)\mu \right\}+ constant \end{align}\]Comparing the quadratic coefficient gives,
\[\frac{1}{\sigma_N^2} = \frac{\mu_o}{\sigma_o^2} + \frac{N}{\sigma^2}\]Next comparing the linear term gives,
\[\sigma_N^{-2}\mu_N = \frac{\mu_o}{\sigma_o^2} + \frac{\Sigma x_n}{\sigma^2}\]And hence, \(\mu_N\) becomes,
\[\mu_N = \frac{\sigma^2\mu_o + \sigma_o^2\Sigma x_n}{\sigma^2 + N \sigma_o^2}\]case 2: \(\mu\) is known
In this case, we will find the parameter \(\lambda \equiv \sigma^{-2}\) given \(N\) independent observations \({\bf X} = \{x_1, x_2, \dots, x_N\}\). Then, the likelihood becomes;
\[\begin{align} p({\bf X} \mid \lambda) & = \prod_{i=1}^{N}p(x_n \mid \lambda) \\ & = \frac{\lambda}{(2\pi)^{N/2}} \exp \left\{ -\frac{\lambda}{2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \\ & \propto \lambda^{N/2}exp \left\{\frac{-\lambda}{2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \end{align}\]To form the conjuagate pair, choose the prior such that it is proportional to \(\lambda^a exp(\alpha \lambda)\) which is a gamma distribution.
\[Gamma(\lambda \mid a, b) = \frac{1}{\Gamma (a)}b^a \lambda^{a-1}exp(-b\lambda)\]Choosing above gamma distribution with parameters \(a_o, b_o\) as the prior, i.e,
\[p(\lambda) = \frac{1}{\Gamma (a_o)}{b_o}^{a_o} \lambda^{a_o-1}exp(-b_o\lambda)\]gives posterior the form;
\[\begin{align} p(\lambda \mid {\bf X}) & \propto p({\bf X} \mid \lambda) p(\lambda) \\ & \propto \lambda^{N/2} \exp \left\{ -\frac{\lambda}{2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \lambda^{a_o-1} \exp (-b_o \lambda) \\ & \propto \lambda^{N/2+a_o-1} \exp \left\{-b_o \lambda -\frac{\lambda}{2}\sum_{n=1}^{N}(x_n - \mu)^2\right\} \end{align}\]which is again a Gamma distribution, say \(Gamma(\lambda \mid a_N, b_N)\) and comparing to the standard Gamma distribution,
\[\begin{align} a_N & = a_o + \frac{N}{2} \\ b_N & = b_o + \frac{1}{2}\sum_{n=1}^{N}(x_n - \mu)^2 \end{align}\]case 3: When both \(\mu\) and \(\sigma^2\) are unkown
Again using \(\lambda \equiv \sigma^{-2}\) for \(N\) independent observations \({\bf X} = \{x_1, x_2, \dots, x_N\}\) likelihood is given by;
\[\begin{align} p({\bf X} \mid \mu, \lambda) & = \prod_{n=1}^{N}p(x_n \mid \mu, \lambda) \\ & = \left[\left(\frac{\lambda}{2\pi}\right)^{1/2}\exp \left\{-\frac{\lambda\mu^2}{2}\right\}\right]^N \exp\left\{\lambda \mu \sum_{n=1}{N}x_n -\frac{\lambda}{2}\sum_{n=1}{N}x_n^2\right\} \end{align}\]Now the prior is choosen similar to the way we did before;
\[p(\mu, \lambda) \propto \left[\left(\lambda\right)^{1/2}\exp \left\{-\frac{\lambda\mu^2}{2}\right\}\right]^\beta \exp\left\{\lambda \mu c -d \lambda \right\}\]where \(c, d, \beta \) are constants. Now the posterior is given by;
\[\begin{align} p(\mu, \lambda \mid {\bf X}) & \propto p({\bf X} \mid \mu, \lambda)p(\mu, \lambda) \\ & \propto \left[\left(\frac{\lambda}{2\pi}\right)^{1/2}\exp \left\{-\frac{\lambda\mu^2}{2}\right\}\right]^N \exp\left\{\lambda \mu \sum_{n=1}{N}x_n -\frac{\lambda}{2}\sum_{n=1}{N}x_n^2\right\} \left[\left(\lambda\right)^{1/2}\exp \left\{-\frac{\lambda\mu^2}{2}\right\}\right]^\beta \exp\left\{\lambda \mu c -d \lambda \right\} \\ & \propto \exp \left\{-\frac{\lambda}{2}(N+\beta)\left(\mu^2-\frac{2\mu\left(\sum x_n + c\right)}{N+\beta} + \left(\frac{\sum x_n + c}{N+\beta}\right)^2\right) \right\} \lambda^{\frac{N+\beta}{2}} \exp \left\{-\lambda\left(d-\frac{\sum x_n^2}{2} - \frac{\left(\sum x_n + c\right)^2}{2(N+\beta)}\right)\right\} \end{align}\]which is of the form of product of Gaussian and Gamma i.e, \(\mathcal{N}(\mu \mid \mu_N, \lambda_N) Gamma(\lambda \mid a_N, b_N)\) and the parameters are given by;
\[\begin{align} \mu_N & = \frac{\sum x_n + c}{N + \beta}\\ \lambda_N & = \lambda (N + \beta)\\ a_N & = 1 + \frac{N+\beta}{2}\\ b_N & = d - \frac{\sum x_n^2}{2} - \frac{\left(\sum x_n + c\right)^2}{2(N+\beta)}\\ \end{align}\]Maximum likelihood parameters for a multivariate Gaussian
Let \({\bf X} = \{x_1, x_2, \dots, x_N\}\) be set of independent observations. The likelihood function is given by:
\[\begin{align} p({\bf X} \mid {\bf \mu}, {\bf \Sigma}) & = \prod_{n=1}^{N} \mathcal{N}(x_n \mid {\bf \mu}, {\bf \Sigma}) \\ & = \frac{1}{(2\pi)^{D/2}}\frac{1}{|{\bf \Sigma}|^{1/2}} \exp \left\{\sum_{n=1}^{N}-\frac{1}{2}({\bf x_n - \mu)}^T {\Sigma}^{-1}({\bf x_n - \mu)}\right\} \end{align}\]We shall estimate parameters of the above density model by first taking log likelihood of the above;
\[\text{ln} p ({\bf X} \mid {\bf \mu, \Sigma}) = -\frac{ND}{2} ln(2\pi) -\frac{N}{2} ln|\Sigma| -\frac{1}{2}\sum_{n=1}^{N}-\frac{1}{2}({\bf x_n - \mu)}^T {\Sigma}^{-1}({\bf x_n - \mu)}\]Estimating the parameter \(\bf \mu\)
Differentiate above equation wrt \({\bf \mu}\) and setting to zero;
\[\frac{\partial \text{ln} p ({\bf X} \mid {\bf \mu, \Sigma})}{\partial \mu} = \sum_{n=1}^{N}\Sigma^{-1} ({\bf x_n - \mu)}\]equating the above to zero;
\[{\bf \mu_{ML}} = \frac{\sum_{n=1}^{N}x_n}{N}\]Estimating the parameter \(\bf \Sigma\)
\[\frac{\partial}{\partial {\bf \Sigma}^{-1}} \text{ln} p ({\bf X} \mid {\bf \mu, \Sigma}) = \frac{N}{2}{\bf \Sigma}^T -\frac{1}{2}\sum_{n=1}^{N}\frac{\partial}{\partial {\bf \Sigma}^{-1}} \left\{ {\bf (x_n - \mu)^T\Sigma^{-1}(x_n - \mu) } \right\}\]equating the above to zero; we get;
\[{\bf \Sigma_{ML}} = \frac{1}{N}\sum_{n=1}^{N}{\bf (x_n - \mu_{ML})(x_n - \mu_{ML}})^T\]