# Sampling under constraints

Let's say I ask you to toss 2 (fair) coins, in a way that in total, they have exactly 1 head. How does one do this? Well, in this case the problem is trivial, since there are exactly two ways this can happen, \((H, T)\) or \((T, H)\), and by fairness, these two outcomes are equally likely. So to sample, you flip one coin, and say that the other coin is the opposite of the first.

Now let's make the problem harder. Suppose we look at \(2n\) coins, and ask that exactly \(n\) of them are heads. How do I sample? There are exactly \(\binom{2n}{n}\) ways to do this, which is asymptotically like \(\fr{4^n}{\sqrt{\pi n}}\) when $n \to \infty$ (see this MathOverflow question). One way to sample from this set is to just sample a subset of size \(n\) uniformly from \(\{1, \ldots, 2n\}\), without replacement, and set them to \(H\). Of course, this is equally hard, until you realize that uniform subsets have a nice "Markov" property:

If you have sampled \(X_1, \ldots, X_k\) for the first \(k\) elements, the next item \(X_{k + 1}\) is just a uniform item from everything that has not been picked thus far.

So you can repeat this step \(n\) times to sample a uniform \(n\)-size subset. Why is this fact true though?

Well, first observe that instead of sampling a uniform *set*, you can sample a uniform *sequence* with no repetitions.
The difference is that you also get an order of the elements you pick. That is, you can first sample the sequence $[4,
1, 5]$ and then forget the order to get the set \(\{1, 4, 5\}\). This is sufficient since every set (of size \(n\)) is
counted exactly \(n!\) times (once for each way of *permuting* them).

Now we have reduced our problem to sampling a uniform sequence of \(n\) elements with no repetitions. How do I pick my first element? Imagine the set of all such sequences, and suppose you are looking at the first element. Why will it be more likely for it to be \(1\) compared to \(2\), or even \(n\)? It won't, because of symmetry, so the first element should be picked uniformly.

Suppose you now have picked the first element \(X_1\), say 20, and are trying to pick the second one \(X_2\). Because of your choice of \(X_1\), you have restricted your possibilities for \(X_2\) to anything but 20. But among them, why will anything be more likely than something else. Again, it won't be, by symmetry, so you should pick the second item uniformly among the ones that are not \(20\) (or in general, not \(X_1\)). The pattern is clear.

# Do what you can right now

Another way of sampling the sequence of coin tosses above is via the following strategy, which also works in an "incremental" fashion. Recall that we are interested in sampling the bits, say \(Z_1, Z_2, \ldots, Z_{2n}\), conditional on the event \(A\) that \(Z_1 + \ldots + Z_{2n} = n\) (I switched to bits \(\{0, 1\}\) instead of heads and tails). Again, let's ask, what is the (conditional) probability that \(Z_1 = 1\). To find this, let's compute

where the equality is via Bayes rule. \(\P(Z_1 = 1) = \P(Z_1 = 0) = 1/2\), so that cancels. The other probabilities are also computable.

by the binomial formula, and

which are equal since \(\binom{2n - 1}{n - 1} = \binom{2n - 1}{n}\). Thus, the first bit \(Z_1\) should be sampled by a fair coin!

Now say we have sampled \(m - 1\) coins and have \(Z_1 + \ldots + Z_{m - 1} = p\). How should \(Z_m\) be sampled? Same strategy (although harder computation). The rest should have \(n - p\) ones, so we have

You don't need to read through the calculation above, but as a sanity check, observe that if \(m = 1\) (so this is the first coin) and \(p = 0\) (necessarily), the above ratio is 1, which checks out with the simpler calculation we did earlier.

If \(\P(Z_m = 1 | Z_m + \ldots + Z_{2n} = n - p) = q\), then we have

which can be solved to get

using the formula that the solution to \(\fr{q}{1 - q} = \fr{a}{b}\) is \(q = \fr{a}{a + b}\).

Observe that if at any point, \(p = n\), then \(q = 0\), i.e., you are not allowed to ever get another 1, which makes sense.
Also, when we are sampling the \(m\)th coin, we have \(2n - m + 1\) coins left to sample (counting this one), and as soon as
this is *tight*, that is, we *need* to sample all ones from now on to reach our goal of \(n\), we must have
\(2n - m + 1 = n - p\) and hence \(q = 1\), forcing us to get a 1.

It might be nontrivial to see why this method is equivalent to sampling a random set, but a beautiful conclusion nonetheless. Moreover, this allows us to reveal the elements we pick one by one, in increasing order, which is not possible using the random subset method we described above, at least not in its vanilla version.

# A general principle

A general principle underlies the technique above. Suppose I have some distribution on sequences \([X_1, \ldots, X_n]\), and
I want to sample it, *conditionally* on some event \(A\). For now, let us restrict to sequences of bits. The above method
suggests the following general strategy:

- Sample \(X_1\) such that \(X_1 = 1\) with probability \(\P(X_1 = 1 | A)\).
- Given samples for \(X_1 = x_1, \ldots, X_{m - 1} = x_{m - 1}\), sample \(X_m\) independently according to $\P(X_m = 1 | A, X_1 = x_1, \ldots, X_{m - 1} = x_{m - 1})$.

This seems reasonable, but what about a formal proof? Turns out, it is not that difficult to justify our method. You can skip ahead to the next section if you do not need a proof to convince you.

Repeatedly using the definition \(\P(E | F) = \P(E, F) / \P(F)\) (*I use \(\P(E, F)\) to mean \(\P(E \cap F)\)*) for events \(E, F\), we have

This is exactly what our scheme achieves.

# Markov chains are simpler

Instead of general sequences \(X_1, \ldots, X_n\), let us restrict to *Markov chains*, which are random processes where the
future is dictated only by the present, and not the past. A well known example is the simple random walk, where we start
at 0, and at each step either add \(+1\) or \(-1\), with equal probability, and repeat (*how is this related to our coin
flip example earlier?*)

Also, assume that the event \(A\) only depends on \(X_n\), the value at the final time. For instance in the random walk case, say \(A = \{X_n < 0\}\). Then our equations above simplify a lot. At some step \(k\), the probability we need to compute is

Applying Bayes rule, but only partially, we get

We do not wish to deal with the denominator, so let's choose another value \(x_k'\), and compute the ratio of probabilities

Of course, \(\P(X_k = x_k | X_1 = x_1, \ldots, X_{k - 1} = x_{k - 1}) = \P(X_k = x_k | X_{k - 1} = x_{k - 1})\), since the Markov chain is only dictated by the present. Further, for the same reason, \(\P(A | X_k = x_k, X_1 = x_1, \ldots, X_{k - 1} = x_{k - 1}) = \P(A | X_k = x_k)\) since \(A\) just depends on the last time \(X_n\). This simplifies the ratio to

Observe that in the absence of this conditioning, the ratio is just

Thus, our conditioning just changed the relative weights to also include the probabilities that the eventual point will
satisfy the condition \(A\). This is precisely the *Doob \(h\)-transform*. Although I have stated it in the discrete case,
the analogous principle continues to hold in great generality.

# Brownian bridges are Markov

To conclude, I wish to present a short calculation showing off the power of this theory. It will require some familiarity with stochastic differential equations, but our computations will be heuristic, so you will be fine as long as you know the basics of Brownian motion.

Imagine Brownian motion \(W_t\) on \([0, 1]\), but forced to return to zero at time 1. This conditioned process is called the Brownian bridge (ignoring subtleties about this singular conditioning). The usual rigorous construction of this process is via \(B_t = W_t - t W_1\), i.e., you compensate Brownian motion linearly to ensure that the value at time 1 is 0. It is absolutely unclear from this description if it is a Markov process or not. Fortunately, the \(h-\)transform provides a great answer. Let us perform a heuristic calculation to derive a differential expression for the Brownian bridge.

Assume that at time \(t\), we are at some point \(z\), and from here we are evaluating the relative probabilities of two potential locations \(u < v\) at time \(t + \eps\) (we will imagine \(\eps\) to be tiny, and "set" it to \(dt\) later). The ratio of the transition probabilities are easy:

(excuse the \(\P\) notation; it is actually a *density*, but as I said earlier, this whole computation is
heuristic).

The weighting factors are also not hard to compute:

where the \(\approx\) just drops the \(\eps\) and will be exact in the \(\eps \to 0\) limit.

Thus the overall ratio is \(\Exp{-f(u)/2} / \Exp{-f(v)/2}\) where \(f\) is defined as

where \(g\) is some function depending only on \(z\) (and \(t, \eps\), but not on \(u\))

and

The \(g(z)\) term will cancel since it will appear for both \(u\) and \(v\).

The form of this expression tells us that the weighted ratio is exactly that of a Gaussian \(N(\g z, \lam^2)\), and therefore, the update rule should be

If you are familiar with stochastic calculus, you might recognize immediately that this is just a weird way of writing

which is exactly the SDE for \(B\). As a sanity check, observe that near \(t = 1\), if \(B_t\) is far from \(0\), the drift term \(-\fr{B_t}{1 - t}\) tries to yank it back strongly towards 0. The strength increases, as it should, as \(t \to 1\). The presence of an SDE expression for \(B\) indicates that this is indeed a Markov process.