"Boosting" is a machine learning technique for amplifying the power of a learning algorithm which is "weak": not very good, but always at least a little better than chance. We start with a data set and our black-box weak learner. We feed the learner the data and it gives us a prediction function, or model. We look at where that model goes wrong (and, as appropriate, by how much it errs). We then draw a weighted sample of the data, giving more weight to the examples where the first model was wrong. This weighted sample is fed to the weak learner, which gives us a new model or prediction function. We now repeat the cycle, increasing the weight of examples which the latest model got wrong, and decreasing the weight of the examples it got right. We continue this boosting for some number of rounds, and, when we want to make a prediction, we use the average of all the predictive models we've gotten out of the weak learner. (We might weight them so those which did better on their training data count for more in the average.) Finis.

The final ensemble or average of all the weaker learner's models is equivalent to a single, generally very complicated, model. A naive view of Occam's Razor says that it should over-fit horribly. Instead, boosting often works extremely well — and much better than the individual models. It doesn't help when the weak learners give answers which are no better than chance, but if they have even a little bit of an edge over chance, even a little bit of predictive power, boosting can indeed vastly amplify that. Why?

Schapire and Freund's Boosting is a sustained attempt, by two
of the founders of the subject, to answer this surprisingly controversial
question. One thing they are quite persuasive about is that this is
not *just* a story about the bias-variance trade-off, with the
individual models being noisy and unstable, but the ensemble getting stabilized
by averaging. (This is the principle of the great Leo Breiman's "bagging",
which he tried to use to account for boosting.) This can't be the whole story,
because boosting can lead to great improvements even when the individuals
models are very stable, so there's no variance to be killed off by averaging.
It's also not simply an optimization story, since while there definitely is an
objective function which boosting is trying to minimize, there are other
algorithms which also target this objective function and do *not* work
anywhere nearly as well.

Schapire and
Freund's preferred
explanation for the effectiveness of boosting has to do with the "margin"
of an ensemble of predictors. Roughly speaking, this is how confident the
over-all ensemble is in its predictions (as opposed to individual models within
the ensemble). Slightly less roughly, the margin of an individual prediction
indicates how much the ensemble would have to be modified to change that
prediction. There is always some distribution of margins over training
examples; as boosting proceeds, these margins tend to get bigger and bigger.
And, as the authors show, it is possible to bound how much an ensemble
over-fits in terms of its distribution of margins --- the more that's skewed
towards high-margin cases, the lower the out-of-sample error. Requiring a
large margin in effect seriously reduces the capacity of the ensemble model to
fit *arbitrary* data, which is where over-fitting comes from.

The first part of the book deals with the fundamental algorithms of
boosting, the essential notions of statistical learning theory, and the
margin-based explanation for why boosting works. There then follows an quite
extraordinary Part II, on different ways of understanding boosting. The first
is to look at as a zero-sum game between the weak learner (which is trying to
classify points correctly) and the boosting algorithm (which is trying to stump
the weak learner). This leads to a truly ingenious proof of von Neumann's
minimax theorem. The second interpretation understands boosting in terms
of online learning
and regret minimization.
The third views boosting as an constrained optimization problem, a fascinating
perspective that draws equally
on information geometry and
linear programming. These chapters are to my
mind the theoretical high-light of the book, a masterful demonstration of how
to connect many different areas of math *illuminatingly*.

Parts III and IV go beyond the core boosting approach for classifiers to consider multi-class prediction, ranking, and more refined theoretical problems, ending with a stochastic process which is, in a sense, the limit of doing infinitely-many infinitely-small boosting rounds in a finite time. These chapters are also very good, but more specialized and not so spectacular as Parts I and II.

There are, of course, omissions. There are versions of boosting for regression and related continuous-valued prediction problems, which have nice interpretations in terms of functional analysis; these are barely touched on. (There's a lot more in Bühlmann and van de Geer's Statistics for High-Dimensional Data.) There's also no attempt to draw connections to collective problem-solving, except, interestingly, rhetorically.

Boosting is, quite simply, one of the best-written books I've read on machine learning, even if it doesn't definitely settle the central question of why boosting works. It is formally self-contained, and could in principle be approached by any reader with reasonable mathematical maturity whether or not they have prior knowledge of machine learning. In practice I imagine some prior hands-on experience with machine learning tools would be very helpful. With such a background it would be great for self-study, and I'd give a lot to teach a class based on the book.

Hardback, ISBN 978-0-262-01718-3

23 June 2012, small fixes 25 June