Information Theory and Large Deviations in the Foundations of Statistics

08 May 2023 23:14

These are three subjects which matter to me, and I'm especially interested in their interconnections. Of course, lots of people are convinced that there are profound connections between information theory and statistics, but I tend to think that the most popular flavor of this, the maximum entropy principle, is deeply misguided. ("CRS disdains his elders and betters, details at 11.") I will, in the usual way, use this notebook to record useful references, things I want to read, etc., but I also can't resist sketching some of what, in my supremely arrogant opinion, is the the right way to think about how these subjects are linked. Specifically: statistics relies on large deviations properties, which are naturally expressed information-theoretically.

Estimating Distributions, Sanov's Theorem, and Relative Entropy; or, You Can Observe a Lot Just By Looking

What Pitman calls the "fundamental theorem of statistics" is the Glivenko-Cantelli Theorem, the assertion that the empirical distribution function converges on the true distribution function. Let's define some terms. \( X_1, X_2, \ldots X_n, \ldots \) are a sequence of random variables, independent and distributed according to some common probability measure \( P \) on the space \( \mathcal{X} \). The empirical measure \( \hat{P}_n \) of a set \( A \) is \[ \hat{P}_n(A) \equiv \frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}_{A}(X_i)} \] or, using the Dirac delta function, \[ \hat{P}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{\delta_{X_i}} \] For one-dimensional variables, consider sets \( A = (-\infty,a] \). The Glivenko-Cantelli theorem asserts that, as \( n\rightarrow \infty \), \[ \sup_{a}{\left| \hat{P}_n(A) - P(A) \right|} \rightarrow 0 \] \( P \)-almost surely. So as we get more and more data, the sample frequencies match the probability of all such one-sided intervals arbitrarily closely. From this, it follows that the sample frequency of any interval matches the probability arbitrarily closely, and so the expectation value of any reasonable test function matches arbitrarily closely. In a phrase, \( \hat{P}_n \) converges in distribution on \( P \), almost surely. This result extends straight-forwardly to any finite-dimensional space.

How quickly does it converge? Here we need to appeal to large deviations theory, and specifically to a result called Sanov's theorem. Let's start with a lie told to children rough sketch. What is the probability that \( \hat{P}_n \) is close to any particular probability measure \( Q \)? Sanov's theorem says that, what \( n \) is large, \[ \Pr{\left( \hat{P}_n \approx Q \right)} \approx \exp{\left\{-nD(Q \| P)\right\}} \] where \( D(Q \| P) \) is the relative entropy or Kullback-Leibler divergence, and probability is taken under the actual distribution of the \( X_i \), i.e., under \( P \). When the space \( X \) takes values in is discrete, this is just the average log-likelihood ratio, \[ D(Q \| P) = \sum_{x}{Q(x) \log{\frac{Q(x)}{P(x)}}} \] with the understanding that \( 0\log{0} = 0 \). (Use L'Hopital's rule if you don't believe that.)

(Why is this the right rate function? Having told a lie to children, my logical inhibitions are already relaxed, so I will be utterly fallacious heuristic. What is the probability that \( \hat{P} \approx Q \), for any given distribution \( Q \)? We'll make the sample space discrete and finite, so every distribution is really some multinomial. When \( X_1, \ldots X_n \) are of "type \( Q \)", meaning that \( \hat{P}_n \approx Q \), the number of \( t \) with \( X_t = x \) must be \( \approx n Q(x) \) for each \( x \in \mathcal{X} \). By elementary combinatorics, the number of length-\( n \) samples of type \( Q \) is \( \frac{n!}{\prod_{x \in \mathcal{X}}{(nQ(x))!}} \). (Notice that the generating probability distribution \( P \) doesn't matter here. Also notice that I'm blithely pretending \( nQ(x) \) is an integer; like I said, this is heuristic.) The probability of the generating distribution \( P \) producing any one \( Q \) type sample is \( \prod_{x \in \mathcal{X}}{P(x)^{nQ(x)}} \), by the IID assumption. So \[ \begin{eqnarray} \Pr(\hat{P}_n \approx Q) & = & n! \prod_{x \in \mathcal{X}}{\frac{P(x)^{nQ(x)}}{(nQ(x))!}}\\ \frac{1}{n}\log{\Pr(\hat{P}_n \approx Q)} & = & \frac{1}{n}\left(\log{(n!)} + \sum_{x\in\mathcal{X}}{n Q(x)\log{P(x)} - \log{(nQ(x))!}}\right)\\ & \approx & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{P(x)} - Q(x)\log{(nQ(x))}}\\ & = & \log{n} + \sum_{x \in \mathcal{X}}{Q(x)\log{(P(x))} - Q(x)\log{n} - Q(x)\log{Q(x)}}\\ & = & \sum_{x \in \mathcal{X}}{Q(x)\log{\frac{P(x)}{Q(x)}}}\\ & = & - D(Q\|P) \end{eqnarray} \] using Stirling's approximation, \( \log{n!} \approx n\log{n} \), and the fact that \( Q \) is a probability distribution, so \( \sum_{x \in \mathcal{X}}{Q(x)\log{n}} = \log{n} \). Remarkably enough, this tissue of fallacies heuristic sketch can be made completely rigorous, even for continuous distributions [e.g., by a sequences of successively refined discretizations of the continuous space].)

Ordinarily, in information theory, if the data comes from \( P \) and we consider using the wrong distribution \( Q \), the penalty we pay for coding is \( D(P \| Q ) \), which is the other way around. But we can understand why the measures should be flipped here. If \( Q(x) = 0, P(x) > 0 \) for some \( x \), nothing special happens, that particular \( x \) just drops out of the sum. But if \( Q(x) > 0, P(x) = 0 \) for some \( x \), then \( D(Q\|P) = \infty \). Said in words, if \( Q \) merely gives zero probability to some values that \( P \) allows, it becomes exponentially unlikely that a large sample looks like \( Q \). But if \( Q \) gives positive probability to a value that \( P \) forbids, there is absolutely no chance that any sample looks like \( Q \).

When \(0 < D(Q \| P) < \infty\), Sanov's theorem tells us that the probability of getting a sample that looks like \( Q \) when drawing data from \( P \) is exponentially small. If \( D(Q \| P) = 0 \), then the probability if getting samples like \( Q \) is decaying sub-exponentially if at all, but one can show that \( D(Q \| P) = 0 \Leftrightarrow Q = P \), so that probability is actually tending towards 1.

I said that my statement of Sanov's theorem was rough. A more precise statement is that, as \( n \rightarrow \infty \), \[ -\frac{1}{n}\log{\Pr{\left(\hat{P}_n \in B\right)}} \rightarrow \inf_{Q \in B}{D(Q \| P)} \] where now \( B \) is some set of probability measures. The relative entropy is the rate function, giving the rate at which the probability of a large deviation from the expected behavior goes to zero. One way to read the \( \inf \) is is to say that the probability of the sample being in some "bad" set \( B \) is dominated by the least-bad distribution in the set --- "if an improbable event happens, it tends to happen in the least-improbable way".

If \( X \) is continuous, then we could use the probability density to define the divergence, \[ D(Q\|P) = \int_{x}{q(x) \log{\frac{q(x)}{p(x)}} dx} \]

In yet more general cases, we fall back on measure theory, and use the Radon-Nikodym derivative, \[ D(Q\|P) = \int{\log{\frac{dQ}{dP}(x)}dQ(x)} \] which reduces to the two earlier formulas in the special cases. We can still read Sanov's theorem as above.

(Strictly speaking, I should really distinguish between the probability of \( \hat{P}_n \) of being in the closure of \( B \) and the probability of its being in the interior, and between \( \limsup \) and \( \liminf \) of the rates, but those details are not important for the present purposes.)

Recall that we wanted to know how quickly the empirical distribution \( \hat{P}_n \) converges to the true distribution \( P \). Fix some neighborhood \( N_{\epsilon} \) of \( P \), with whatever way you like of gauging how far apart two distributions. Then we have \[ -\frac{1}{n}\log{\Pr{\left( \hat{P}_n \not\in N_{\epsilon} \right)}} \rightarrow \inf_{Q \not\in N_{\epsilon}}{D(Q \| P)} \] That infimum will be some increasing function of \( \epsilon \), say \( r(\epsilon) \). [If we measure the distance between probability measures in total variation, it's \( O(\epsilon^2) \).] So, roughly speaking, the probability of being more than \( \epsilon \) away from the truth is about \( \exp{\left\{ -nr(\epsilon) \right\}} \), neither more nor less. (It's exponentially small, but only exponentially small.) More formally, Sanov's theorem gives matching upper and lower bounds on the asymptotic large deviations probability.

Hypothesis Testing, the Neyman-Pearson Lemma, and Relative Entropy

Sanov's theorem thus tells us something about the efficacy of estimating a distribution by just sitting and counting. It can also tell us about hypothesis testing. We take the most basic possible hypothesis testing problem, of distinguishing between a null distribution \( P_0 \) and an alternative distribution \( P_1 \). We know, from the Neyman-Pearson lemma, that the optimal test is a likelihood-ratio test: say "1" when \[ \Lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\log{\frac{p_1(x_i)}{p_0(x_i)}}} \] is above some threshold \( t_n \), and say "0" when it's below. The value of this threshold is usually picked to enforce a desired false-alarm rate \( \alpha \), i.e., it is the shadow price of power at a certain sample size.

There are two error probabilities here: the probability of saying "1" under \( P_0 \), and the probability of saying "0" under \( P_1 \). The first, the false-alarm rate, is controlled by fixing \( t \).

Since the log likelihood ratio is a function of the sample, we can write it in terms of the empirical distribution: \[ \Lambda_n = \mathbf{E}_{\hat{P}_n}\left[ \log{\frac{p_1(X)}{p_0(X)}} \right] \]

Saying that \( \Lambda_n < t_n \) is thus equivalent to saying that \( \hat{P}_n \in B(t_n) \), for some set of distributions \( B(t_n) \). Enforcing the constraint on the false alarm probability means that \[ \Pr_{P_0}{\left( \hat{P}_n \in B(t_n) \right)} = 1-\alpha \] At the same time, again by Sanov's theorem, the probability, under \( P_0 \), that \( -\Lambda_n \not \in (D(P_0\|P_1) - \epsilon, D(P_0\|P_1) + \epsilon) \) is going to zero exponentially fast in \( n \). So it must be the case that \( t_n \rightarrow -D(P_0\|P_1) \).

Now what about those miss (type II error) probabilities? This is when the test should say "1" but says instead "0". \[ \beta_n = \Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)} \] So, using Sanov again, \[ -\frac{1}{n}\log{\beta_n} = \frac{1}{n}\log{\Pr_{P_1}{\left( \hat{P}_n \in B(t_n) \right)}} \rightarrow \inf_{Q \in B(t_n)}{D(Q \| P_1)} \] When is \( Q \in B(t_n) \)? To keep things simple, assume we can multiply and divide by \( q(x) \) everywhere we need to. \[ \begin{eqnarray*} t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ \log{\frac{p_1(x)}{q(x)}\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & \mathbf{E}_{Q}\left[ -\log{\frac{p_1(x)}{q(x)}}+\log{\frac{q(x)}{p_0(x)}}\right]\\ t_n & \geq & D(Q\|P_0) - D(Q\|P_1)\\ D(Q\|P_1) & \geq & D(Q\|P_0) - t_n \end{eqnarray*} \] In other words, we're looking for distributions which are sufficiently closer, in divergence, to \( P_0 \) as opposed to \( P_1 \). Since \( t_n \rightarrow -D(P_0\|P_1) \), as we saw above, in the long run we need \( D(Q\|P_1) \geq D(Q\|P_0) + D(P_0\|P_1) \). To make this as small as possible, set \( Q = P_0 \). Thus \[ -\frac{1}{n}\log{\beta_n} = D(P_0 \| P_1) \]

This argument can also be made to work if the size is allowed to go to zero --- if we constrain the size to be \( \alpha_n \rightarrow 0 \). By parallel reasoning, if we wanted to require a fixed power (constrained to \( \beta \), or constrained to \( \beta_n \rightarrow 0 \) ) and let size go to zero, the best exponential rate we could attain for the false alarm rate is \( D(P_1 \| P_0) \).

To sum up, the large deviation rate function gives the error rate for optimal hypothesis tests.

The Point Where I Gave Up Writing This Out

There is a natural duality between hypothesis testing and parameter estimation. If I have a consistent estimator \( \hat{\theta} \), I can use it to test whether \( \theta = \theta_0 \) by seeing whether \( \hat{\theta} \approx \theta_0 \). In particular, estimators cannot converge too fast, or they would give us tests which would break the bounds we've just seen. Lower bounds on hypothesis testing this breed lower bounds on estimation error. This actually gives us the Cramer-Rao bound, as well as some refinements. (Though there is a slicker way of deriving Cramer-Rao from divergence.)

The General Pattern

Statistical inference depends on some kind of empirical distribution --- the marginal distribution (as above), or over pairs (as in a Markov process), or path measures, or motifs in a graph. When the probability gods are kind, that empirical distribution obeys a large deviations principle, which gives matching upper and lower bounds for fluctuations away from the data-generating distribution (at least on an exponential scale). The rate in the LDP is (usually) a relative entropy / KL divergence. This is why information-theoretic quantities control statistical inference.

Further topics I'd cover here, if I had the energy:

See also: Estimating Entropies and Informations; Information Theory; Large Deviations Theory; Statistics;