Notebooks

## Cross-Validation

17 Jul 2022 15:24

One of the most brilliantly simple and compelling ideas in all of statistics: to estimate how well your model will do on new data, take your data set and divide it into two parts at random. Fit the model to one part and then evaluate its prediction on the other; average over a couple of splits into training and testing sets.

As a method of model selection; as (not quite the same thing) a means of estimating the generalization error of a statistical model; relations to bootstrapping. How best to cross-validate time series? Spatial models? Networks? Other kinds of structured data? Relation to "stability" in learning theory.

— I'd like to say I'm astonished at the number of people I encounter who think that cross-validation was invented by computer scientists rather than statisticians; but I know how academia works. Some of my references below are to help counter this amnesia.

Recommended, close-ups:
• Sylvain Arlot
• "V-fold cross-validation improved: V-fold penalization", arxiv:0802.0566 [Seeing cross-validation as a penalization method, and improving it accordingly by strengthening the penalty term]
• "Model selection by resampling penalization", Electronic Journal of Statistics 3 (2009): 557--624, arxiv:0906.3124
• Sylvain Arlot and Alain Celisse, "Segmentation of the mean of heteroscedastic data via cross-validation", Statistics and Computing 21 (2011): 613--632, arxiv:0902.3977 [MATLAB code]
• Sylvain Arlot and Matthieu Lerasle, "Choice of V for V-Fold Cross-Validation in Least-Squares Density Estimation", arxiv:1210.5830 [The paper formerly known as "Why V=5 is enough in V-fold cross-validation"]
• Prabir Burman, Edmond Chow and Deborah Nolan, "A cross-validatory method for dependent data", Biometrika 81 (1994): 351--358 [JSTOR]
• Patrick S. Carmack, William R. Schucany, Jeffrey S. Spence, Richard F. Gunst, Qihua Lin and Robert W. Haley, "Far Casting Cross Validation", Journal of Computational and Graphical Statistics 18 (2009): 879--893 [Leave-one-out CV, with a constant-radius window skipped around each hold-out point as well; this is designed to deal with correlations in time or in space.]
• Kehui Chen and Jing Lei, "Network Cross-Validation for Determining the Number of Communities in Network Data", arxiv:1411.1715
• Matthieu Cornec, "Concentration inequalities of the cross-validation estimator for Empirical Risk Minimiser", arxiv:1011.0096
• Lászlo Györfi, Michael Kohler, Adam Krzyzak and Harro Walk, A Distribution-Free Theory of Nonparametric Regression [chapters 7 and 8 have important results on data splitting and cross-validation]
• Darren Homrighausen and Daniel J. McDonald
• Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
• Guillaume Lecué and Charles Mitchell, "Oracle inequalities for cross-validation type procedures", Electronic Journal of Statistics 6 (2012): 1803--1837
• Charles Mitchell and Sara van de Geer, "General Oracle Inequalities for Model Selection", Electronic Journal of Statistics 3 (2009): 176--204 [Analyzes a data-set splitting scheme (like cross-validation with only one "fold")]
• Art B. Owen, Patrick O. Perry, "Bi-cross-validation of the SVD and the nonnegative matrix factorization", Annals of Applied Statistics 3 (2009): 564--594, arxiv:0908.2062
• Jeffrey S. Racine
• "Feasible Cross-Validatory Model Selection for General Stationary Processes", Journal of Applied Econometrics 12 (1997): 169--179 [JSTOR. This is closely related to (maybe algebraically just a special case of?) the familiar trick from splines of writing the CV criterion in terms of the hat/influence/projection matrix.]
• "Consistent cross-validatory model-selection for dependent data: hv-block cross-validation", Journal of Econometrics 99 (2000): 39--61
• Ryan J. Tibshirani and Robert Tibshirani, "A bias correction for the minimum error rate in cross-validation", Annals of Applied Statistics 3 (2009): 822--829 = arxiv:0908.2904
• Mark J. van der Laan and Sandrine Dudoit, "Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples" [PDF working paper, i.e., a 100-page tome. The first part proves that multi-fold cross-validation and the like will work for selecting the best estimator out of a finite set of estimators (provided the loss function is nicely bounded and the data are IID). The second part ingeniously turns this into a complete estimation procedure, by effectively creating a discrete sieve and then using CV to say which part of the sieve to use. This is a very cool set of results, but (1) the limitations to bounded loss functions make me nervous, and (2) the formulas appearing in the finite-sample and even asymptotic bounds are ugly. On the other hand, they have finite-sample bounds! — I wonder if the bounded-and-IID restrictions could be lifted using the techniques in Jiang's "On Uniform Deviation Bounds" (link and description under Learning Theory), or those in Dedecker et al.'s Weak Dependence.]
• Aad W. van der Vaart, Sandrine Dudoit and Mark J. van der Laan, "Oracle inequalities for multi-fold cross validation", Statistics and Decisions 24 (2006): 351--371 [Streamlined and improved versions of the key results from the van der Laan/Dudoit tome. Thanks to Prof. van der Vaart for a reprint]
To write:
• CRS, "Cross-validation for mixing processes" [using some notions from learning with dependent data]
• CRS + co-conspirators, "Cross-validation for networks"