|ALL YOUR BAYES ARE BELONG TO US|
|we share with him|
Attention conservation notice: jargony, self-promotional ramblings about a new paper on the frequentist properties of non-parametric Bayesian procedures, which is exactly as statistics-geeky as it sounds. Plus a weird analogy to mathematical models of evolution. Even if this sounds vaguely interesting, you could always check back later and see if peer review exposed it as a tissue of fallacies. [It didn't.]
Here's the new preprint:
There are people out there who see Bayes's Rule as the key to all methodologies, something essential to rationality. I find this view thoroughly misguided and not even a regulative ideal. But I've written about that at great length elsewhere and won't repeat myself here.
While there are certainly plenty of statisticians who embrace the Bayesian way in all its Savagery (some of them will be on my tenure committee), I think it's fair to say that most of the time, Bayesian methods are not used in contemporary statistics and machine learning to impose coherence on statisticians', or computers', personal beliefs. When people like Andy Gelman, Aleks Jakulin &c. estimate logistic regressions with a particular default prior, what they are doing is really regularization, which is something that a frequentist like me can understand.
Actually, Bayesian inference is a regularization-by-smoothing technique, only in the space of probability distributions instead of the sample space. In ordinary least-squares regression, one asks "what straight line comes closest (in the Euclidean sense) to all the data points?" In subtler methods, one has a whole menagerie of possible curves, and asks which superposition of curves comes closest to the data points. To avoid over-fitting, one needs to keep the weight given to very wiggly basis curves under control, e.g., by simply ignoring possible terms which are not sufficiently smooth. (Through the magic of Lagrange multipliers, this is equivalent to penalizing wiggliness.) As one gets more and more data, finer and finer features of the regression curve can be reliably discerned, so the amount of imposed smoothing should be relaxed. For estimating regression curves, this is all thoroughly understood and even automated (making the persistence of linear regression a mystery; but another time).
All of this carries over to estimating entire distributions rather than just curves. In the space of probability distributions, the entire sample is a single point, the empirical distribution. (I'm ignoring some subtleties about time series here.) The family of models we're willing to consider forms a manifold (or set of manifolds; more details) in the space of probability measures. In this space, the right way to measure distance isn't the Euclidean metric, but rather the relative entropy, a.k.a. the Kullback-Leibler divergence, which generates the information geometry. The maximum likelihood estimate is simply the geometric projection of the empirical distribution on to the manifold of models. In other words, the most likely predictive distribution comes from this single point on the manifold. (There are cleverer ways to get frequentist predictive distributions, though.) The Bayesian posterior predictive distribution gives some weight to all points on the manifold; this is the smoothing. The weights are proportional to the prior, and also decline exponentially with the product of sample size and relative entropy. The effect of the prior is to blunt the impact of what we actually see, keeping us from simply trying to match it — exactly like the effect of smoothing. The amount of smoothing done by the prior declines as the sample size grows.
Why smooth or regularize in the first place? As every school-child knows, the answer lies in the bias-variance trade-off. (Pedantically, bias-variance is for mean-squared error, but other loss functions have similar decompositions.) Regularized estimates are less responsive to differences between data sets, which means that they have less variance. No matter what the data look like, the regularized estimates all have the same shape, so to speak. But unless reality also has that shape, this creates a systematic error, i.e., bias. So the more aggressively we smooth, the more we decrease variance, but the more we increase bias. If some degree of smoothing removes more variance than it adds bias, we come out ahead. Of course, the variance should fall automatically as we get more data, but the bias depends on the strength of smoothing, so the latter should shrink as the sample grows. The important thing in analyzing this type of estimation or prediction scheme is to check whether they relax their regularization at the right rate, so that they don't get killed by either bias or variance, but rather consistently converge on the truth, or at least the best approximation to the truth among the available models.
All of this applies to Bayesian learning. Like any other regularization scheme, it is a way of deliberately introducing bias into our inferences, not so much on Burkean/Quinean grounds that our prejudices providentially point to the truth, but simply to reduce variance, the extent to which our inferences are at the mercy of Fortune (in her role as the goddess of sampling fluctuations). The question about it then becomes whether it gets the trade-off right, and manages to converge despite its bias. In other words, when does Bayesian inference possess frequentist consistency?
A surprisingly large number of people have been satisfied with a result given by the great probabilist J. L. Doob, which essentially says that under some reasonable-looking conditions, the Bayesian learner's prior probability of being inconsistent is zero. (More exactly, the posterior probability of any set of models containing the truth goes to 1, except on a set of sample paths whose prior probability is zero.) Even taken at face value, this just says that each Bayesian is convinced a priori that they'll converge on the truth, not that they actually are almost sure to find the truth.
Examining the reasonable-looking conditions of Doob's result, it turns out that they entail the existence of a consistent non-Bayesian estimator. (Doob's original assumptions can actually be weakened to just the existence of such an estimator; see Schervish's Theory of Statistics, ch. 7.) It is a curious fact that every proof of the consistency of Bayesian learning I know of requires the existence of a consistent non-Bayesian estimator. (Though it need not be the maximum likelihood estimator.) There don't seem to be any situations where Bayesian updating is the only convergent route to the truth.
It turns out that Bayesian inference is not consistent in general. The late David Freedman and the very-much-alive Persi Diaconis showed that if you choose a prior distribution which is badly adapted to the actual data-generating process, your posterior distribution will happily converge on the wrong model, even though the rest of the set-up is very tame — independent and identically distributed data in a boringly well-behaved sample space, etc.
Still, there are many situations where Bayesian learning does seem to work reasonably effectively, which in light of the Freedman-Diaconis results needs explaining, ideally in a way which gives some guidance as to when we can expect it to work. This is the origin of the micro-field of Bayesian consistency or Bayesian nonparametrics, and it's here that I find I've written a paper, rather to my surprise.
I never intended to work on this. In the spring of 2003, I was going to the statistics seminar in Ann Arbor, and one week the speaker happened to be Yoav Freund, talking about this paper (I think) on model averaging for classifiers. I got hung up on why the weights of different models went down exponentially with the number of errors they'd made. It occurred to me that this was what would happen in a very large genetic algorithm, if a solution's fitness was inversely proportional to the number of errors it made, and there was no mutation or cross-over. The model-averaged prediction would just be voting over the population. This made me feel better about why model averaging was working, because using a genetic algorithm to evolve classifier rules was something I was already pretty familiar with.
The next day it struck me that this story would work just as well for Bayesian model averaging, with weights depending on the likelihood rather than the number of errors. In fact, I realized, Bayes's rule just is the discrete-time replicator equation, with different hypotheses being so many different replicators, and the fitness function being the conditional likelihood.
As you know, Bob, the replicator dynamic is a mathematical representation of the basic idea of natural selection. There are different kinds of things, the kinds being called "replicators", because things of one kind cause more things of that same kind to come into being. The average number of descendants per individual is the replicator's fitness; this can depend not only on the properties of the replicator and on time and chance, but also on the distribution of replicators in the population; in that case the fitness is "frequency dependent". In its basic form, fitness-proportional selection is the only evolutionary mechanism: no sampling, no mutation, no cross-over, and of course no sex. The result is that replicators with above-average fitness increase their share of the population, while replicators with below-average fitness dwindle.
This is a pretty natural way of modeling half of the mechanism Darwin and Wallace realized was behind evolution, the "selective retention" part — what it leaves out is "blind variation". Even with this omission, the replicator equation is a surprisingly interesting kind of dynamical system, especially when fitness is frequency-dependent, which opens up deep connections to evolutionary game theory. (ObBook.) Interestingly, however, Bayes is a very limited special case of the replicator equation, since fitness is frequency independent.
"Selective retention" is also the idea that lies behind reinforcement learning and Thorndike's law of effect. Crudely, these are all variations on the theme of "do more of what worked, and less of what didn't". Less crudely, there are at least three independent discoveries of how reinforcement learning, itself, leads to the replicator equation in the appropriate limit. So Bayesian updating isn't just a special case of evolutionary optimization; it's also something like habit formation.
Initially, I wrote this all up in the spring of 2003 and then set it aside, because, after making the formal connections, I didn't see what it was good for. Then, that summer, I went to the first "Science et Gastronomie" workshop, talked about this idea, and realized, from the conversations, that I actually could do something with it. The fitness function was going to end up being the relative entropy rate, and this would control where the posterior distribution concentrated in the long run. This would let me say something about the convergence of Bayesian learning with non-independent and even non-Markovian data, but also about what happens when the true distribution is not "in the support of the prior", i.e., when all the models really are wrong.
So I spent a lot of time in Lyon sweating over the asymptotic behavior of the integrated likelihood and so forth (literally, given the great heat-wave). By the end of the summer, I had versions of what are now Theorems 2, 3 and 4 in the paper. These say, respectively, that the posterior density goes to zero everywhere where fitness isn't maximized, and the rate is the fitness difference; that eventually the posterior distribution concentrates on the global peaks of the fitness landscape; and that the posterior distribution in a subset of model space not including those peaks is driven by the highest fitness in the subset. All of this was pretty nice and I was fairly pleased with myself.
Then I saw that if my results were correct, Bayesian updating should always be consistent. So I put the thing aside for a while....
Eventually I figured out what I was doing wrong; I was presuming that all the points in model-space were equally well-behaved. I was explicitly assuming that the (log) likelihood would eventually converge to the relative entropy for each hypothesis. This is on the one hand mathematically harmless (it's asymptotic equipartition), and on the other hand statistically I can't see how any likelihood-based method can hope to converge unless performance over a sufficiently long past is indicative of future results. (This is why such methods do not solve the problem of induction.) But I was further assuming, implicitly, that there was no way for the likelihood of a hypothesis with very high relative entropy to also converge very slowly. That is, I was assuming the Bayesian learner could not acquire bad habits. But of course it can; the bad habits just need to seem like good ideas at first. In human terms:
No one ever forgets how to do something that's worked for them in the past. Just replacing it with another behavior can be hard enough, and the old behavior is still going to be lurking there underneath it. Thieves keep stealing. Liars keep lying. Drunks never forget about chemically modifying their nervous systems.Similarly, the Bayesian learner never forgets about the model which matches the data perfectly — until it turns out that the model had just memorized the first 13,127 observations, and then repeats them forever. When that model crashes, however, there is always another one which memorized the first 26,254 observations...
What's needed is some restrictions on the prior distribution which keep it from putting too much weight on these bad hypotheses, though actually in non-parametric problems one doesn't want to give them strictly zero weight, because, hey, maybe a particular sequence of 13,127 observations really will repeat forever. (Metaphorically: a little lying, stealing and drinking is part of a complete life.) We are back to regularization, which has the duty of promoting virtue and suppressing vice.
The most common ways of doing this divide the space of models into distinct classes, and then use statistical learning theory or empirical process theory to gauge the risk of over-fitting within these constrained subspace. Typical arguments involve things like showing that every hypothesis in the constrained set is close to one of a finite number of hypotheses which "cover" or "bracket" the space, which gives uniform convergence. Then one relaxes the constraint as more data arrives, according to some deterministic schedule ("the method of sieves") or to optimize the trade-off between actual performance and a bound on over-fitting ("structural risk minimization"), etc.
Existing proofs of Bayesian consistency in non-parametric problems basically work similarly. To simplify just a bit, the trick has two parts. The first is to find constraints on the hypotheses which are strong enough to ensure uniform convergence, but can be relaxed to include everything; this is what you'd do anyway if you wanted a consistent non-parametric procedure. (Some people don't explicitly posit constraint sets, but do other things with much the same effect.) The second is to construct a prior whose bias towards the most-constrained sets is strong enough to keep the wild, high-capacity parts of the hypothesis space from dominating the posterior distribution, but isn't so biased that the constraint can't be overcome with enough data.
This is what I ended up having to do. (There was a gap of several years between seeing that some constraint was needed, and seeing what constraint would work. I am, in fact, a sloth.) My constraint was, roughly, "the log likelihood converges at least this fast". Unfortunately, I wasn't able to express it in relatively nice terms, like covering numbers, though I suspect someone cleverer could replace it with something along those lines. (It's about one-sided deviations of time-averages from their limiting values, so it feels empirical-process-y.) Anyone who actually cares will actually be better served by reading the paper than by my trying to re-express it verbally.
I didn't figure out the right constraint to impose, and the right way to relax it, until the summer of 2008. (This was what was on my mind when I read Chris Anderson's ode to overfitting.) Once I did, everything else was pretty direct, especially since it turned out I could salvage most (but definitely not all) of what I'd done in Lyon. One of the bigger efforts in actually writing the paper was uniformly eliminating talk of replicators and fitness, except for a small appendix, in favor of more statistical jargon. (I hope I succeeded.) Adding an example, namely trying to use Markov chains to predict a sofic system, took about a day. This is an amusing case, because the posterior distribution never converges — you can always do strictly better by moving to Markov chains of higher order. Nonetheless, the (Hellinger) distance between the posterior predictive distribution and the actual predictive distribution goes to zero.
Having submitted this, I'm going to put it aside again until I hear from the referees, because there's a lot of other work which needs finishing. (When I do hear from the referees, I anticipate a certain amount of gnashing of teeth.) But in the meanwhile, I'll use this space to jot down some ideas.
Thanks to Nicolas Della Penna, Shiva Kaul and Chris Wiggins for typo-spotting.
Posted at January 30, 2009 22:47 | permanent link