The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   154

Model Selection and Model Averaging

by Gerda Claeksens and Nils Lid Hjort

Cambridge: Cambridge University Press, 2008

Cambridge Series in Statistical and Probabilistic Mathematics

How Can You Choose Just One?

A statistical model is a story about how the statistician's data might have been generated by a stochastic process. Sometimes we are in the very fortunate situation where our background knowledge tells us that there is really only one plausible story, or at least only one story we know how to tell, and so there is just one model, and we can do all of our statistical inference within that. This is the usual situation presented in statistical textbooks, but it is more of a lie-told-to-children than a reliable guide. The more typical situation is that we very much want to use a parametric statistical model, but neither scientific nor statistical considerations fix on a single model as the right story. We have several possible competing models, and need to either use the data to somehow settle on one — the problem of model selectionThis book is the best available review of model selection from a statistical standpoint. It has a very nice combination of just-enough statistical theory with lots of non-trivial worked examples, and the theory is well-presented and useful, without much being left to folklore. (Some details are referred to Claeskens's and Hjort's papers.)

Chapter 1 opens with some examples to illustrate why reliable model selection would be a good thing, and some of the goals it could serve. Chapter 2 gives a clear, clean presentation of the reasoning behind the famous Akaike information criterion (AIC), as an unbiased estimate of the expected negative log-likelihood (a.k.a. relative entropy or Kullback-Leibler divergence). KL divergence is presented just as a nearly-arbitrary way of saying how far apart two probability distributions are, with the nice property of being zero when and only when the model is exactly right; there is no discussion of either its connection to hypothesis testing (it controls the exponential rate at which power grows), and only a little of its meaning in terms of coding and information (in chapter 4).

The derivations leading to the classical AIC make various quite restrictive assumptions, such as the model specification being completely correct; sections 2.4--2.8 and 2.10 discuss modifications to allow for mis-specification, estimators more robust to outliers than maximum likelihood, using bootstrapping rather than large-sample theory, etc. Section 2.9 explains the connection with cross-validation; a blunt summary is that AIC is a fast asymptotic approximation to leave-one-out cross-validation, and that the former's domain of validity is a strict subset of the latter's. (This does not diminish Akaike's achievements!)

Chapter 3 presents the "Bayesian" information criterion, which is really not very Bayesian* and should probably be called Schwarz's information criterion; the presentation goes through the now-standard route of Laplace approximation, rather than Schwarz's original ideas. It also (sec. 3.5) presents the deviance information criterion of Spiegelhalter et al., noting (p. 92) that in large samples it becomes equivalent to one of the Takeuichi information criterion, one of the more-robust forms of AIC. Section 3.6 glances at minimum description length ideas, in the simplest form where they look very much like BIC.

Chapter 4 is largely about the statistical properties of model selection criteria of the "log-likelihood minus penalty term" type. The two big properties considered are consistency (does the selector pick out the truth, when it's among the options?) and efficiency (does the selected model predict nearly as well as an Oracle which knew the right model?). Annoyingly, while BIC is consistent and AIC is not, AIC is often more efficient. There are mathematical obstacles to coming up with selectors which are both consistent and efficient, glanced at in section 4.9.**

Chapters 5 and 6 form a unit (including shared notation); the shared topic is understanding when it would be better to use the wrong model for one's estimates. The idea here, which is very sound, goes all the way back to the bias-variance trade-off: if one is using an overly simple model, one's estimate of various parameters will (probably) be biased, but it will also have less variance, because there are fewer parameters to estimate. How mis-specified does the simpler model have to be, before the cost in extra bias over-comes the gain in reduced variance, and how much does this depend on exactly which parameter one is trying to estimate? This is very much a finite-n question, because in the limit where the sample size goes to infinity, one is always better off using the true model. Claeskens and Hjort finesse this by using a sequence of local alternative models, where the extra parameters of the large model are shrinking to zero at a root-n rate. This lets them get very precise statements about how much mis-specification can be tolerated, and the asymptotic estimation of error of different parameters in different models. Their "focused information criterion" (FIC) advises one to (i) pick a "focal" parameter one wants to estimate, (ii) calculate the mean squared estimation error in different models using their formulas, and (iii) pick the estimate associated with the lowest error. Different focal parameters might be estimated with different models, even with the same data, depending on their individual bias-variance trade-offs.

These chapters are the most distinctive parts of the book (together with chapter 7, they draw heavily on two papers by the authors [preprint versions]), and also the most mathematically challenging. The idea of guiding model selection to serve a specific inferential goal, which can change which model is selected, seems to me to be entirely right-headed. But sequence-of-local-alternatives device used here, I cannot love. The local neighborhood technology has been a standard tool in some branches of asymptotic statistics since at least Le Cam, but it is a hard thing to make sense of probabilistically, since it violates Kolmogorov's consistency conditions. Claeskens and Hjort are fully aware of this, of course, but view it as a way of getting some information about the practical situation facing the statistician out of asymptotic theory (see especially Remark 5.3 on p. 128). I suspect it would be profitable to re-examine the topic from a strictly finite-sample stand-point using something like structural risk minimization.

Be that as it may, one issue with model selection is its interaction with more ordinary forms of statistical inference. If one first does model selection and then parameter estimation within the selected model, the usual measures of uncertainty within that model neglect the extra uncertainty due to selection. An alternative then to simply selecting one model and doing inference within it is to average over multiple models, with weights somehow reflecting how well they did. Chapter 7 accordingly presents model averaging based on AIC, BIC, FIC, and a full posterior distribution, looking closely (secs. 7.8--7.9) at the frequentist properties of the last. This chapter also shoe-horns in shrinkage estimators and thresholding.

Chapter 8 turns to goodness-of-fit tests, with the idea being to check a parametric model by embedding it in a non-parametric, infinite-series-expansion model, and seeing how many terms of the potentially-infinite series are selected by some information criterion. (As explained in sec. 8.6, this is closely related to the idea of Neyman's smooth test.) The presentation here is very similar to that of Hart's Nonparametric Smoothing and Lack-of-Fit Tests, unsurprisingly since Hart is a collaborator of Claeskens's.

Chapter 9 goes back to examples from chapter 1, re-working them in detail with the tools assembled over the previous chapters. Chapter 10 is a grab-bag of further topics, including the peculiarities of allowing for random effects, what happens when the restricted model lives not in the middle of the larger parameter space but on its boundary, and the proper handling of missing data.

I would have liked to see more about cross-validation (I don't think multi-fold CV is ever addressed), more about the subtleties that arise from latent variables, and some coverage of encompassing, of structural risk minimization a la Vapnik, and of non-nested model selection tests a la Cox and Vuong. But you can't have everything, and what we do have here is very good. The implied reader has a good grasp of the usual asymptotic theory of maximum likelihood estimation, and in particular of Fisher information. Of the deeper reaches of asymptotic statistics, there is a bit of Le Cam theory in chapters 5 and 6, a mention of empirical processes in chapter 8, and a hint of minimax theory in chapter 4. All the models considered are for IID data, or for regression, or are survival/hazard rate models (the last being unusual but welcome); time series models are mentioned but not examined in any depth. Concretely, I don't think there is any background presumed which a reader of (say) Davison's Statistical Models, or even of Casella and Berger, would lack; say, graduate students at the beginning of their second year. It would also work well for self-study.

*: Because a real Bayesian never feels a need to select a model at all, but rather maintains a posterior distribution over the whole model space at all times.

**: Work by Casella and Consonni, which appeared too late to be noticed in this book, suggests that "the set of parameters on which this dichotomy [between consistency and efficiency] occurs is extreme, even pathological, and should not be considered when evaluating model selectors". It's probably fair to say that this question is currently unsettled.

xvii + 312 pp., list of notation, bibilography, author index, subject index (thin), numerous black-and-white graphs

Hardback, ISBN 978-0-521-85225-8

Probability and Statistics

8 January 2012