The objective of perceptual inference is to find the distribution of useful mental states based on sensory observations of the world. A wet feeling on the face when going outside for instance should probably give a hight probability to the mental state rain, but also some probability to the mental state active lawn sprinkler.
The loss function used for perceptual inference is variational free energy (VFE). It can be expressed in many ways, each with a different but complementary interpretation.
Below are a few different ways to express VFE, \(\mathcal F[q; o]\) 1:
$$\mathcal F[q; o] =$$
$$\sum_s \log q(s) q(s) – \sum_s \log p(s, o) q(s) = \ \ \ \ (1a)$$
$$\sum_s \log q(s) q(s) – \sum_s \log p(s \mid o) q(s) – \sum_s \log p(o) q(s) = \ \ \ \ (2a)$$
$$\sum_s \log q(s) q(s) – \sum_s \log p(o \mid s) q(s) – \sum_s \log p(s) q(s) \ \ \ \ (3a)$$
Each of these formulations can be expressed as a KL-divergence with a “residual term”:
$$\mathcal F[q; o] =$$
$$D_{KL}\left[q(s) \mid \mid p(s, o) \right] = \ \ \ \ (1b)$$
$$D_{KL}\left[q(s) \mid \mid p(s \mid o) \right] – \log p(o) = \ \ \ \ (2b)$$
$$D_{KL}\left[q(s) \mid \mid p(s)\right] – \mathbb{E}_{q(s)} \left[\log p(o \mid s)\right] \ \ \ \ (3b)$$
And a bonus variant:
$$-\mathbb H[q(s)]-\mathbb{E}_{q(s)}\left[\log p(o, s) \right] \ \ \ \ (1c)$$
The entity \(\mathbb H\) found in the last expression is called entropy.
$$ \mathbb H[q(s)]= \sum_s q(s) \log \frac{1}{q(s)} = \mathbb E_{q(s)} \log \frac{1}{q(s)} = – \mathbb E_{q(s)} \log q(s)$$
In the derivations we have also used
$$p(o, s) = p(o \mid s)p(s) = p(s \mid o)p(o)$$
and
$$\sum_s \log p(o) q(s) = \log p(o) \sum_s q(s) = \log p(o)$$
Expression \((2b)\) can be interpreted as the sum of divergence and surprise.
Divergence measures the accuracy of the approximate posterior, i.e., how close it is to the true posterior distribution of states, \(p(s \mid o)\).
Surprise, \(-\log p(o)\), measures how far from an expected observation the current observation is (or equivalently, how far from an expected mental state the current mental state is). For a fish it is surprising to be on land. The more probable the observation based on the generative model, the smaller the surprise. The probability of the observation is given by:
$$p(o) = \sum_s p(o \mid s)p(s)$$
This means that surprise is a function of the expected mental states, \(p(s)\). If the current observation is likely for state \(s\) and that state is an expected state, then \(p(o)\) will also be high and the observation will be unsurprising. (We will in later posts discuss the distinction between expected mental states and desired mental states. As everybody knows, they are not always the same. At least not for humans.)
Expression \((3b)\) can be interpreted as the sum of complexity and accuracy.
Complexity measures how far the approximate posterior is from the prior beliefs, the expected mental states \(p(s)\).
Accuracy is high when the mental states implied by our observations are similar to the mental states predicted by \(q(s)\).
The complexity – accuracy formulation is particularly interesting as it would give a mathematical explanation of the empirically firmly supported psychological phenomenon cognitive dissonance [5]. There is a cognitive cost and discomfort associated with changing one’s beliefs, especially if these beliefs are considered as parts of one’s identity. Try for instance to talk somebody out of their religion or political sympathies by pointing at observations (or lack thereof).
Expression \((1c)\) is usually interpreted as the difference between energy and entropy.
If the energy is low, i.e., if the inferred mental state distribution is probable under the generative model, we can afford a high precision (peaked) \(q(s)\) meaning a low entropy. If the energy is high meaning that the generative model doesn’t give a high probability to the inferred mental state distribution, then this has to be compensated with a high entropy (flat) \(q(s)\) to keep free energy low. High entropy means that we are uncertain and therefore must spread the probability mass on many mental states.
All the formulations are communicating vessels in the sense that if you for instance minimize the divergence in \((2b)\), which is the only term that can be minimized in perceptual inference, then we will get the optimal tradeoff between complexity and accuracy and between energy and entropy. Also, it can be shown that \(\mathcal F[q; o] \geq 0\).
- The square brackets indicate that \(\mathcal F\) is a functional, a function of a function, of \(q\). ↩︎