This is a collection of formulas and definitions useful for posts on active inference, machine learning and perhaps other things.
Kullback-Leibler divergence
Kullback-Leibler divergence measures the dissimilarity between two distributions, here \(q(x)\) and \(p(x)\):
$$D_{KL}\left[q(x) \mid \mid p(x) \right] = \sum_x \log \frac{q(x) }{p(x)} q(x)=- \sum_x \log \frac{p(x)}{q(x)} q(x)$$
Any sum or integral where the terms (or integrand) are multiplied by a probability distribution, above \(q(x)\), can be interpreted as en expected (expectation). The expression below means that we calculate the average of the expression inside the brackets by sampling \(x\) from \(q(x)\). We get an average of the terms weighted according to the probability distribution \(q(x)\). This means that:
$$D_{KL}\left[q(x) \mid \mid p(x) \right] = \mathbb{E}_{q(x)}\left[\log\frac{q(x) }{p(x)}\right] = \mathbb{E}_{q(x)}\left[\log q(x) – \log p(x) \right]$$
If \(x\) is discrete, then we take the sum over all values of \(x\).
If \(x\) is continuous, we can estimate the integral using Monte Carlo integration, see below.
Monte Carlo integration
Integrals involving a probability distribution \(q(x)\) can be approximated with the average of a limited number of “samples”:
$$\int_x f(x)q(x)dx := \mathbb E_{q(x)} \left[f(x)\right] \approx \frac{1}{S}\sum_{j=1}^S f(x^{(j)})$$
with \(x^{(j)}\) sampled from \(q(x)\). This approximation technique is called Monte Carlo integration.
Bayes’ theorem
$$p(s \mid o) = \frac{p(o, s)}{p(o)} = \frac{p(o \mid s)p(s)}{p(o)} = \frac{p(o \mid s)p(s)}{\sum_s p(o \mid s)p(s)}$$
Free energy
Free energy is what we optimize in perceptual inference. It can be formulated in many ways and be given many different interpretations.
We assume that all variables below are discrete and that all probability distributions are therefore categorical, i.e., can be described in a vector of probabilities summing up to one. Variational free energy can be expressed in many ways:
$$\mathcal F(q, o) = D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) =$$
$$\sum_s \log q(s) q(s) – \sum_s \log p(s \mid o) q(s) – \sum_s \log p(o) q(s)=$$
$$D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) =$$
$$\sum_s \log q(s) q(s) – \sum_s \log p(o \mid s) q(s) – \sum_s \log p(s) q(s)=$$
$$D_{KL}\left[q(s) \mid \mid p(s)\right] – \mathbb{E}_{q(s)} \left[\log p(o \mid s)\right]=$$
$$-\mathbb H[q(s)]-\mathbb{E}_{q(s)}\left[\log p(o, s) \right]$$
The entity \(\mathbb H\) found in the last expression is called entropy.
$$ \mathbb H[q(s)]= \sum_s q(s) \log \frac{1}{q(s)} = \mathbb E_{q(s)} \log \frac{1}{q(s)} = – \mathbb E_{q(s)} \log q(s)$$
We also use
$$p(o, s) = p(o \mid s)p(s) = p(s \mid o)p(o)$$
and
$$\sum_s \log p(o) q(s) = \log p(o) \sum_s q(s) = \log p(o)$$
A few other ways to slice variational free energy:
$$\mathcal{F}(q, o) = D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) = $$
$$\sum_{s} \log \frac{q(s) }{p(s \mid o)} q(s) – \sum_s \log p(o) q(s) = $$
$$\sum_s \log \frac{q(s)}{p(s \mid o)p(o)} q(s) =$$
$$\sum_s \log \frac{q(s) }{p(s, o)} q(s) =$$
$$D_{KL}\left[q(s) \mid \mid p(s, o) \right]$$
A simplification
We assume that the current state only depends on the previous state and that the observation only depends on the current state. This means that:
$$p(s_{0:T} \mid o_{0:T}) = p(o_T | s_{0:T}, o_{0:T-1}) \times p(s_{0:T}, o_{0:T-1})=$$
$$p(o_T | s_T)p(s_T \mid o_{0:T-1}, s_{0:T-1})p(o_{0:T-1}, s_{0:T-1})=$$
$$p(o_T | s_T)p(s_T \mid s_{T-1})p(o_{0:T-1}, s_{0:T-1})= \ldots$$
$$p(s_0) \prod_{t=0}^{T} p(o_t | s_t) \prod_{t=1}^{T} p(s_t | s_{t-1})$$
Links
[1] Khan Academy. Gradient descent.
[2] Volodymyr Kuleshov, Stefano Ermon. Variational inference. Class notes from Stanford course CS288.