Notebooks

Estimating Entropies and Informations

18 Oct 2023 13:56

The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the macroscopic state, at least once we fix our partition of microscopic states into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, sometimes called the "empirical entropy". However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.

What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping. [Update, 2022: That guess was wrong! Moon and Hero (2014) and Kandasamy et al. (2015) both give non-parametric estimators with (asymptotic) confidence intervals that don't require bootstrapping. I particularly like the Kandasamy et al. approach, but I admit to a certain amount of home-team bias there.]

For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.

Using gzip is a bad idea (for this purpose; it works fine for data compression).

(Thanks to Sankaran Ramakrishnan for pointing out a think-o in an earlier version.)


Notebooks: