For most of the book, we are concerned with the following problem. We want
our machine to learn to make a certain kind of binary distinction over a
certain set of objects, called ``instances''. Our authors, following a common
practice in computer science, call this learning a concept, and if we think of
the distinction as being that between ``do this'' or ``don't do that,'' it's
actually not that far from what some philosophers, e.g. Stephen Toulmin, would
reduce concepts to. (I suspect that the philosophers and machine-learners
would both be annoyed by this.) To give a real
example, though not one featured in this book, the instances could be EKG
charts, suitably coded for the computer, and the target concept could be ``has
a heart condition needing therapy within a year.'' Data are provided to our
machine by an ``oracle,'' which selects an instance at random, according to
some fixed distribution, and then tells the machine what it picked and whether
or not the concept applies to it; successive instances are statistically
independent. (In the example, the oracle could be either a doctor experienced
in reading such charts, or, perhaps better, a review a after a year of whether
or not a symptomatic heart condition developed.) We also need to specify the
way the machine represents concepts --- getting the representation class wrong
can turn even easy problems into intractable ones. (Generally, there's a
trade-off between having a wide class of representations, one which could
handle many different concepts, and having representations which can be quickly
and easily manipulated.) In the actual EKG experiment, the representations
were surfaces defined by polynomials of some fixed order; everyone whose chart
was on one side of the surface was deemed to have a heart condition. Despite
the importance of the representation class, it is assumed throughout the book
that, first, the analysis of learning can take the representation as more less
given, and, second, that there is always at least one available representation
which captures the target concept *exactly.* The latter assumption is to
some degree a convenience of exposition for this book; the former is
ubiquituous in the field. The focus, in any case, is on designing machines
that, under these assumptions, quickly, simply and reliably reach
representations which are close to the target concept.

The central notion of this book, and indeed of the whole school of
machine-learners which it represents, is Leslie Valiant's idea of ``probably
approximately correct''(PAC) learning. A PAC learning algorithm is one which,
if it's fed enough data, has an arbitrarily high probability (i.e. an
arbitrarily high confidence) of producing a representation, called the
``hypothesis,'' which is arbitrarily close to the target concept, one with
arbitrarily small error, defined as the sum of the probabilities of making
false positives and making false negatives. (Both these probabilities are to
be taken with respect to the distribution of instances used in the oracle.)
That is to say, we can not only demand a 95% confidence that the output is a
guess which errs at most 1% of the time, and be satisfied if we're sufficiently
patient while the machine queries; we can equally well demanded any level
of confidence and any level of accuracy. (There is a related notion of ``weak
PAC learning,'' where the confidence and accuracy are both fixed, but the
burden of chapter four is to show that when we know how to do this, we can
always use it as part of a PAC learning algorithm in the stronger sense.)
*Efficient* PAC learning is a reflection of our impatience: we demand
that the machine deliver the goods after only a restricted number of queries of
the oracle, the upper bound being given by some function polynomial in the
complexity of the target concept, the confidence level and the accuracy of the
hypothesis. (Strictly speaking: the length of the shortest representation of
the concept in the machine's concept class, the reciprocal of one minus the
confidence level, and the reciprocal of the error probability, respectively.)
The bulk of the book is concerned with results about when PAC learning of
concepts is possible, and when it can be efficient.

The first general result along these lines occurs in chapter two, and is presented, somewhat fancifully, as a form of Occam's Razor. Suppose we have an algorithm which, confronted with data from the oracle, gives us a hypothesis which is consistent with the data and guaranteed to be not too much more complicated than the target concept, and moreover does this in polynomial time. The authors call this an Occam algorithm, and show that if we have it, we can build a PAC learning algorithm around it, though not necessarily an efficient one.

We can, however, prove with relative ease a fairly broad theorem about a sufficient condition for efficient learning, which however requires a little preliminary work. Vapnik-Chervonenkis dimension is a way of quantifying the ease of learning categories from small data set, the rate at which approximations converge. (In fact Vapnik and Chervonenkis introduced it in a paper ``On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.'') The VC dimension of a class of concepts is the size of the largest set of instances which the concepts ``shatter,'' i.e. where every possible dichotomy of the instance-set can be obtained from one or another of the concepts. (However you want to break up the instances, there's a concept which will do it.) If the VC dimension is finite, then those concepts can be efficiently PAC learned, in a time proportional to the dimension. There are however classes with infinite VC dimension which can be efficiently learned, so the converse isn't true.

We can drop some elements from the original setting in which we presented PAC learning, without changing things too much. The oracle can occasionally give the machine mis-leading information, and we can have the machine pick instances to ask the oracle about, rather than having the oracle choose them. Both of these modifications are covered in this book, in chs. 5 and 8. The case of non-independent instances is mentioned, but not handled in the text, though good references are given. Efficient PAC learning is still possible in all these cases. It is not possible, however, if the concepts (treated now as Boolean functions, rather than just classes) have a more than a certain minimum of computational power, at least not if certain assumptions in number theory (which form the basis of RSA public-key cryptography, and so of PGP) are true.

This is the easiest introduction to the theory of machine learning I've found, but it still requires a fair degree of knowledge of computer science, at the very least a grasp of computational complexity on the level of a good undergraduate course on the analysis of algorithms. If that's in place, however, it makes a fine book for self-study. Computer scientists who don't already know this material will find it useful, as will statisticians, and even cognitive scientists and philosophers of science can find things worth thinking about here.

xiv + 207 pp., numerous black and white diagrams, end-of-chapter bibliographic notes, full bibliography, index of subjects

Computers and Computing / Probability and Statistics

Currently in print as a hardback, US$35.00, ISBN 0-262-11193-4, LoC Q325.5 K44

5 May 1999