For most of the book, we are concerned with the following problem. We want our machine to learn to make a certain kind of binary distinction over a certain set of objects, called ``instances''. Our authors, following a common practice in computer science, call this learning a concept, and if we think of the distinction as being that between ``do this'' or ``don't do that,'' it's actually not that far from what some philosophers, e.g. Stephen Toulmin, would reduce concepts to. (I suspect that the philosophers and machine-learners would both be annoyed by this.) To give a real example, though not one featured in this book, the instances could be EKG charts, suitably coded for the computer, and the target concept could be ``has a heart condition needing therapy within a year.'' Data are provided to our machine by an ``oracle,'' which selects an instance at random, according to some fixed distribution, and then tells the machine what it picked and whether or not the concept applies to it; successive instances are statistically independent. (In the example, the oracle could be either a doctor experienced in reading such charts, or, perhaps better, a review a after a year of whether or not a symptomatic heart condition developed.) We also need to specify the way the machine represents concepts --- getting the representation class wrong can turn even easy problems into intractable ones. (Generally, there's a trade-off between having a wide class of representations, one which could handle many different concepts, and having representations which can be quickly and easily manipulated.) In the actual EKG experiment, the representations were surfaces defined by polynomials of some fixed order; everyone whose chart was on one side of the surface was deemed to have a heart condition. Despite the importance of the representation class, it is assumed throughout the book that, first, the analysis of learning can take the representation as more less given, and, second, that there is always at least one available representation which captures the target concept exactly. The latter assumption is to some degree a convenience of exposition for this book; the former is ubiquituous in the field. The focus, in any case, is on designing machines that, under these assumptions, quickly, simply and reliably reach representations which are close to the target concept.
The central notion of this book, and indeed of the whole school of machine-learners which it represents, is Leslie Valiant's idea of ``probably approximately correct''(PAC) learning. A PAC learning algorithm is one which, if it's fed enough data, has an arbitrarily high probability (i.e. an arbitrarily high confidence) of producing a representation, called the ``hypothesis,'' which is arbitrarily close to the target concept, one with arbitrarily small error, defined as the sum of the probabilities of making false positives and making false negatives. (Both these probabilities are to be taken with respect to the distribution of instances used in the oracle.) That is to say, we can not only demand a 95% confidence that the output is a guess which errs at most 1% of the time, and be satisfied if we're sufficiently patient while the machine queries; we can equally well demanded any level of confidence and any level of accuracy. (There is a related notion of ``weak PAC learning,'' where the confidence and accuracy are both fixed, but the burden of chapter four is to show that when we know how to do this, we can always use it as part of a PAC learning algorithm in the stronger sense.) Efficient PAC learning is a reflection of our impatience: we demand that the machine deliver the goods after only a restricted number of queries of the oracle, the upper bound being given by some function polynomial in the complexity of the target concept, the confidence level and the accuracy of the hypothesis. (Strictly speaking: the length of the shortest representation of the concept in the machine's concept class, the reciprocal of one minus the confidence level, and the reciprocal of the error probability, respectively.) The bulk of the book is concerned with results about when PAC learning of concepts is possible, and when it can be efficient.
The first general result along these lines occurs in chapter two, and is presented, somewhat fancifully, as a form of Occam's Razor. Suppose we have an algorithm which, confronted with data from the oracle, gives us a hypothesis which is consistent with the data and guaranteed to be not too much more complicated than the target concept, and moreover does this in polynomial time. The authors call this an Occam algorithm, and show that if we have it, we can build a PAC learning algorithm around it, though not necessarily an efficient one.
We can, however, prove with relative ease a fairly broad theorem about a sufficient condition for efficient learning, which however requires a little preliminary work. Vapnik-Chervonenkis dimension is a way of quantifying the ease of learning categories from small data set, the rate at which approximations converge. (In fact Vapnik and Chervonenkis introduced it in a paper ``On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.'') The VC dimension of a class of concepts is the size of the largest set of instances which the concepts ``shatter,'' i.e. where every possible dichotomy of the instance-set can be obtained from one or another of the concepts. (However you want to break up the instances, there's a concept which will do it.) If the VC dimension is finite, then those concepts can be efficiently PAC learned, in a time proportional to the dimension. There are however classes with infinite VC dimension which can be efficiently learned, so the converse isn't true.
We can drop some elements from the original setting in which we presented PAC learning, without changing things too much. The oracle can occasionally give the machine mis-leading information, and we can have the machine pick instances to ask the oracle about, rather than having the oracle choose them. Both of these modifications are covered in this book, in chs. 5 and 8. The case of non-independent instances is mentioned, but not handled in the text, though good references are given. Efficient PAC learning is still possible in all these cases. It is not possible, however, if the concepts (treated now as Boolean functions, rather than just classes) have a more than a certain minimum of computational power, at least not if certain assumptions in number theory (which form the basis of RSA public-key cryptography, and so of PGP) are true.
This is the easiest introduction to the theory of machine learning I've found, but it still requires a fair degree of knowledge of computer science, at the very least a grasp of computational complexity on the level of a good undergraduate course on the analysis of algorithms. If that's in place, however, it makes a fine book for self-study. Computer scientists who don't already know this material will find it useful, as will statisticians, and even cognitive scientists and philosophers of science can find things worth thinking about here.