March 04, 2010

The True Price of Models Pulling Themselves Up by Their Bootstraps

For a project I just finished, I produced this figure:

I don't want to give away too much about the project (update, 19 April: it's now public), but the black curve is a smoothing spline which is trying to predict the random variable Rt+1 from Rt; the thin blue lines are 800 additional splines, fit to 800 bootstrap resamplings of the original data; and the thicker blue lines are the resulting 95% confidence bands for the regression curve [1]. (The tick marks on the horizontal axis show the actual data values.) Making this took about ten minutes on my laptop, using the boot and mgcv packages in R.

The project gave me an excuse to finally read Efron's original paper on the bootstrap, where my eye was caught by "Remark A" on p. 19 (my linkage):

Method 2, the straightforward calculation of the bootstrap distribution by repeated Monte Carlo sampling, is remarkably easy to implement on the computer. Given the original algorithm for computing R, only minor modifications are necessary to produce bootstrap replications R*1, R*2, ..., R*N. The amount of computer time required is just about N times that for the original computations. For the discriminant analysis problem reported in Table 2, each trial of N = 100 replications, [sample size] m = n = 20, took about 0.15 seconds and cost about 40 cents on Stanford's 370/168 computer. For a single real data set with m = n = 20, we might have taken N=1000, at a cost of \$4.00.

My bootstrapping used N = 800, n = 2527. Ignoring the differences between fitting Efron's linear classifier and my smoothing spline, creating my figure would have cost \$404.32 in 1977, or \$1436.90 in today's dollars (using the consumer price index). But I just paid about \$2400 for my laptop, which will have a useful life of (conservatively) three years, a ten-minute pro rata share of which comes to 1.5 cents.

The inexorable economic logic of the price mechanism forces me to conclude that bootstrapping is about 100,000 times less valuable for me now than it was for Efron in 1977.

Update: Thanks to D.R. for catching a typo.

[1]: Yes, yes, unless the real regression function is a smooth piecewise cubic there's some approximation bias from using splines, so this is really a confidence band for the optimal spline approximation to the true regression curve. I hope you are as scrupulous when people talk about confidence bands for "the" slope of their linear regression models. (Added 7 March to placate quibblers.)

Enigmas of Chance

Posted at March 04, 2010 13:35 | permanent link

Three-Toed Sloth