Information Geometry

27 Feb 2017 16:30

This a slightly misleading name for applying differential geometry to families of probability distributions, and so to statistical models. Information does however play two roles in it: Kullback-Leibler information, or relative entropy, features as a measure of divergence (not quite a metric, because it's asymmetric), and Fisher information takes the role of curvature. One very nice thing about information geometry is that it gives us very strong tools for proving results about statistical models, simply by considering them as well-behaved geometrical objects. Thus, for instance, it's basically a tautology to say that a manifold is not changing much in the vicinity of points of low curvature, and changing greatly near points of high curvature. Stated more precisely, and then translated back into probabilistic language, this becomes the Cramer-Rao inequality, that the variance of a parameter estimator is at least the reciprocal of the Fisher information. As someone who likes differential geometry, and now is interested in statistics, I find this very pleasing.

As a physicist, I have always been somewhat bothered by the way statisticians seem to accept particular parametrizations of their models as obvious and natural, and build those parameterization into their procedures. In linear regression, for instance, it's reasonably common for them to want to find models with only a few non-zero coefficients. This makes my thumbs prick, because it seems to me obvious that if I regressed on arbitrary linear combinations of my covariates, I have exactly the same information (provided the transformation is invertible), and so I'm really looking at exactly the same model --- but in general I'm not going to have a small number of non-zero coefficients any more. In other words, I want to be able to do coordinate-free statistics. Since differential geometry lets me do coordinate-free physics, information geometry seems like an appealing way to do this. There are various information-geometric model selection criteria, which I want to know more about; I suspect, based purely on this disciplinary prejudice, that they will out-perform coordinate-dependent criteria.

I should also mention that statistical physics, while it does no actual statistics, is also very much concerned with probability distributions. Sun-Ichi Amari, who is the leader of a large and impressive Japanese school of information-geometers, has a nice result (in, e.g., his "Hierarchy of Probability Distributions" paper) showing that maximum entropy distributions are, exactly, the ones with minimal interaction between their variables --- the ones which approach most closely to independence. I think this throws a very interesting new light on the issue of why we can assume equilibrium corresponds to a state of maximum entropy (pace Jaynes, assuming independence is clearly not an innocent way of saying "I really don't know anything more"). I also see, via the Arxiv, that people are starting to think about phase transitions in information-geometric terms, which seems natural in retrospect, though I can't comment further, not having read the papers.

See also: Exponential Families of Probability Measures, where the geometry is especially nice; Filtering and State Estimation for some papers on differential-geometric ideas in statistical state estimation and signal processing; Partial Identification of Parametric Statistical Models