A "distance" between distributions
Given two probability distributions \(P\) and \(Q\) on some common space \(\Om\) (we will assume things are discrete for simplicity), one may ask how different they are. A very useful "metric" for this is the KL divergence, defined as
and read as the KL divergence from \(P\) to \(Q\). It is a simple exercise on Jensen's inequality that this is always non-negative, and zero only when \(P = Q\).
The reason this is called a divergence, and not a metric is because it is not symmetric, $\KL(P, Q) \neq \KL(Q, P)$ in general. Wikipedia states,
While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem.
For example, let \(P = \Ber(p), Q = \Ber(q)\) be two Bernoulli distributions, i.e., if \(X \sim P\), then \(X = 1\) with probability \(p\) and zero otherwise. Then,
and \(\KL(Q, P)\) is the same with the roles of \(q\) and \(p\) reversed. An easy way to check that this is not symmetric is to set \(p = 1/2, q = 1\), so that
where we write \(+\infty\) for \(\fr{1}{2} / 0\) (to be precise, KL divergence is defined to be \(+\infty\) if a division by zero occurs, so we are good). However, this is not always the case. For instance, for two Gaussian distributions with the same variance 1:
which is symmetric!
Asymmetry encodes something
The lack of symmetry in KL divergence is however, not a flaw. A beautiful interpretation, which also explains this asymmetry, is via the information-theoretic notion of surprise.
Say some experiment produces outcomes in some space \(\Om\) (it is enough to think about \(\Om = \{ H, T \}\), the coin-toss experiment, but remember that our coins will not always be fair). A collection of outcomes is called an event; we will denote these by \(A, B, C, \ldots\). Suppose we have some preconceived notions about the coin (say we think the coin is actually fair), and this grants us a probability distribution \(Q\) over the outcomes (so we think that \(P(H) = P(T) = 1/2\)). Now define the surprise of an event \(A\) as
There exist formal characterizations of this definition, and why this makes sense, but allow me to argue heuristically. Firstly, some event that happens with probability 1 does not surprise you, so that matches since \(-\log 1 = 0\). Further, events of smaller probability surprise us more when they happen, which also checks out. With these conditions, and additivity under independence, we get the stated form, but I won't discuss this further.
Now suppose, I have some belief \(Q\), but the true law of the experiment is \(P\) (so say I hoped the coin is fair, but it is actually heads with only probability \(1/3\)). Then how much do I get surprised by, on average?
where \(\E_P\) denotes expectation with respect to \(P\). Of course, since surprise is always non-negative, there is some surprise even if I guessed correctly, i.e., if \(Q = P\),
The interesting quantity then, is the excess surprise due to incorrect beliefs:
Voilà! This is the KL divergence!
Further, there is no reason to think that this excess surprise should be symmetric. Indeed, if your beliefs are too extreme (say you think \(p = 1\), i.e., that the coin will certainly land heads), you have a huge (and here, infinite) expected surprise, because you are infinitely surprised as soon as you see tails. However, more moderate beliefs will have less trouble even with extreme results, since some probability is allocated to those events as well. We saw this earlier with \(p = 1/2, q = 1\), and in general, one should expect
Before going on the next section, I wish to indicate another fact about KL that might be useful. If \(P^n, Q^n\) are the (joint) laws of \(n\) i.i.d. draws from \(P\) and \(Q\) respectively, then \(\KL(P^n, Q^n) = n \KL(P, Q)\). We refer to this by saying that KL tensorizes.
A result in large deviations
Say I toss \(n\) coins with independent outcomes $X_1, X_2, \ldots, X_n \sim \Ber(q)$ for some \(q\) (here a head is counted as 1 and a tail as 0). \(S_n = X_1 + \ldots + X_n\) is the total number of heads I see. Think of \(n\) as being very large. By usual laws of large numbers, \(S_n \approx nq\), in the sense that \(S_n/n \approx q + \eps\) where \(\eps\) is typically of \(O(n^{-1/2})\). In fact, the Central Limit Theorem, tells us that \(\sqrt{n}\eps\) is approximately a Gaussian random variable, which is very tight (squared exponentially so).
In such a setting, \(\P(S_n \geq nx)\) is a very unlikely event when \(x > q\). One of the fundamental results in large deviations theory identifies exactly how unlikely it is:
where we shorten our notation to mean \(\KL(x, q) = \KL(\Ber(x), \Ber(q))\). That is,
To the best of my knowledge, this result is one of the most natural and elementary occurrences of KL divergence in probability.
However, the form of this result appears to be at odds with the earlier interpretation. Isn't the true distribution \(\Ber(p)\) and the surprise happens when it has to behave like a \(\Ber(x)\), in order to satisfy \(S_n \geq nx\)? By that logic, the KL involved should be \(\KL(p, x)\) and not \(\KL(x, p)\).
As the next section shows, this is not the case.
Surprise is the lack of information
As earlier, let's say you have some belief \(Q\) about some experiment, let's say by how much some stock goes up tomorrow. Let's also assume that you are absolutely correct, the true law is indeed \(Q\). But, this time, you are the victim of insider trading -- some event \(A\) has already happened (lets say \(A\) is the event that the CEO resigned), but you were not told about it. Thus, the true distribution after this disaster, \(P\), is the conditional distribution of \(Q\) given \(A\), i.e., \(P(x) = Q(x) / Q(A)\) for all \(x \in A\) (and zero elsewhere), but you are blissfully unaware. How much different is the world now? Via some simple algebra,
That is, the expected excess surprise now, i.e., KL, is exactly the surprise of \(A\) under your original (correct) beliefs.
The result above can be rephrased to yield
where \(Q_A = P\) is the conditional distribution of \(Q\) given \(A\). Observe, that this is not \(\KL(Q, Q_A)\), which can be huge since \(Q_A\) is a more extreme belief than \(Q\) (why?).
The conditional distribution
The proof of the result mentioned earlier does not actually go via this route of computing \(\KL(Q_A,Q)\), but instead resorts to a tilting trick -- I will leave that for another day. However, in general, solving problems on large deviations often involves understanding the conditional law under the rare event \(A\). The expression above illustrates why this is the case.
A warning. Comparing this expression to the theorem earlier seems to indicate that \(Q_A\) is comparable to an i.i.d. sequence of \(\Ber(x)\). One must be careful with this comparison, since the sum of \(n\) copies of \(\Ber(x)\) fluctuates at \(O(\sqrt{n})\), but for the conditioned measure, the fluctuation is only \(O(1)\) (this can be derived, at least heuristically, from the large deviation result above).