# A "distance" between distributions

Given two probability distributions \(P\) and \(Q\) on some common space \(\Om\) (we will assume things are discrete for
simplicity), one may ask how different they are. A very useful "metric" for this is the *KL divergence*, defined as

and read as the *KL divergence from \(P\) to \(Q\)*. It is a simple exercise on Jensen's inequality that this is always
non-negative, and zero only when \(P = Q\).

The reason this is called a divergence, and **not** a metric is because it is not *symmetric*, $\KL(P, Q) \neq \KL(Q,
P)$ in general. Wikipedia
states,

While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem.

For example, let \(P = \Ber(p), Q = \Ber(q)\) be two *Bernoulli* distributions, i.e., if \(X \sim P\), then
\(X = 1\) with probability \(p\) and zero otherwise. Then,

and \(\KL(Q, P)\) is the same with the roles of \(q\) and \(p\) reversed. An easy way to check that this is not symmetric is to set \(p = 1/2, q = 1\), so that

where we write \(+\infty\) for \(\fr{1}{2} / 0\) (to be precise, KL divergence is defined to be \(+\infty\) if a division by zero occurs, so we are good). However, this is not always the case. For instance, for two Gaussian distributions with the same variance 1:

which is symmetric!

# Asymmetry encodes something

The lack of symmetry in KL divergence is however, not a flaw. A beautiful interpretation, which also explains this
asymmetry, is via the information-theoretic notion of *surprise*.

Say some experiment produces outcomes in some space \(\Om\) (it is enough to think about \(\Om = \{ H, T \}\), the coin-toss
experiment, but remember that our coins will not always be fair). A collection of outcomes is called an event; we will
denote these by \(A, B, C, \ldots\). Suppose we have some preconceived notions about the coin (say we think the coin is
actually fair), and this grants us a probability distribution \(Q\) over the outcomes (so we think that \(P(H) = P(T) = 1/2\)).
Now define the *surprise* of an event \(A\) as

There exist formal
characterizations
of this definition, and why this makes sense, but allow me to argue heuristically. Firstly, some event that happens
with probability 1 does not surprise you, so that matches since \(-\log 1 = 0\). Further, events of smaller probability
surprise us more when they happen, which also checks out. *With these conditions, and additivity under independence, we
get the stated form, but I won't discuss this further*.

Now suppose, I have some belief \(Q\), but the *true law of the experiment* is \(P\) (so say I hoped the coin is fair,
but it is actually heads with only probability \(1/3\)). Then how much do I get surprised by,
on average?

where \(\E_P\) denotes expectation with respect to \(P\). Of course, since surprise is always non-negative, there is some surprise even if I guessed correctly, i.e., if \(Q = P\),

The interesting quantity then, is the *excess surprise due to incorrect beliefs*:

Voilà! This is the KL divergence!

Further, there is no reason to think that this excess surprise should be symmetric. Indeed, if your beliefs are too extreme (say you think \(p = 1\), i.e., that the coin will certainly land heads), you have a huge (and here, infinite) expected surprise, because you are infinitely surprised as soon as you see tails. However, more moderate beliefs will have less trouble even with extreme results, since some probability is allocated to those events as well. We saw this earlier with \(p = 1/2, q = 1\), and in general, one should expect

Before going on the next section, I wish to indicate another fact about KL that might be useful. If \(P^n, Q^n\) are the
(joint) laws of \(n\) i.i.d. draws from \(P\) and \(Q\) respectively, then \(\KL(P^n, Q^n) = n \KL(P, Q)\). We refer to this
by saying that KL *tensorizes*.

# A result in large deviations

Say I toss \(n\) coins with independent outcomes $X_1, X_2, \ldots, X_n \sim
\Ber(q)$ for some \(q\) (here a head is counted as 1 and a tail as 0). \(S_n = X_1 + \ldots + X_n\) is the total number of
heads I see. Think of \(n\) as being very large. By usual laws of large numbers, \(S_n \approx nq\), in the sense that
\(S_n/n \approx q + \eps\) where \(\eps\) is typically of \(O(n^{-1/2})\). In fact, the Central Limit
Theorem, tells us that \(\sqrt{n}\eps\) is
approximately a Gaussian random variable, which is *very tight* (squared exponentially so).

In such a setting, \(\P(S_n \geq nx)\) is a very unlikely event when \(x > q\).
One of the fundamental
results
in large deviations theory identifies *exactly how unlikely* it is:

where we shorten our notation to mean \(\KL(x, q) = \KL(\Ber(x), \Ber(q))\). That is,

To the best of my knowledge, this result is one of the most natural and elementary occurrences of KL divergence in probability.

However, the form of this result appears to be at odds with the earlier interpretation. Isn't the true distribution \(\Ber(p)\) and the surprise happens when it has to behave like a \(\Ber(x)\), in order to satisfy \(S_n \geq nx\)? By that logic, the KL involved should be \(\KL(p, x)\) and not \(\KL(x, p)\).

As the next section shows, this is not the case.

# Surprise is the lack of information

As earlier, let's say you have some belief \(Q\) about some experiment, let's say *by how much some stock goes up
tomorrow*. Let's also assume that you are absolutely correct, the true law is indeed \(Q\).
But, this time, you are the victim of insider trading -- some event \(A\) has already happened (lets say \(A\)
is the event that the CEO resigned), but you were not told about it. Thus, the true distribution after this disaster, \(P\),
is the *conditional distribution* of \(Q\) given \(A\), i.e., \(P(x) = Q(x) / Q(A)\) for all \(x \in A\) (and zero elsewhere),
but you are blissfully unaware. How much different is the world now? Via some simple algebra,

That is, the expected excess surprise now, i.e., KL, is exactly the surprise of \(A\) under your original (correct) beliefs.

The result above can be rephrased to yield

where \(Q_A = P\) is the conditional distribution of \(Q\) given \(A\). Observe, that this is **not** \(\KL(Q, Q_A)\), which can
be huge since \(Q_A\) is a more extreme belief than \(Q\) (why?).

# The conditional distribution

The proof of the result mentioned earlier does not actually go via this route of computing \(\KL(Q_A,Q)\), but instead
resorts to a *tilting trick* -- I will leave that for another day. However, in general, solving problems on large
deviations often involves understanding the conditional law under the rare event \(A\). The expression above illustrates
why this is the case.

A warning. Comparing this expression to the theorem earlier seems to indicate that \(Q_A\) is comparable to an i.i.d. sequence of \(\Ber(x)\). One must be careful with this comparison, since the sum of \(n\) copies of \(\Ber(x)\) fluctuates at \(O(\sqrt{n})\), but for the conditioned measure, the fluctuation is only \(O(1)\) (this can be derived, at least heuristically, from the large deviation result above).