MIT Press, 2000

The first is an engaging and fairly non-technical survey of the field of machine learning, somewhere between the level of New Scientist and a friendly intro. textbook (which doesn't seem to exist for the field). This book is excellent and highly recommendable, if sometimes a bit too cute for its own good.

The other book claims to draws out the implications of machine learning, its achievements and limitations, for a host of Deep Issues --- induction and the reliability of knowledge, the nature of the scientific endeavor, questions of representation and embodiment in cognitive science, the nature of pattern or regularity, even the reality of theoretical entities. It would be kind to describe the thinking displayed in this book as ``sloppy''.

Unfortunately, it is not easy to disentangle the two books. I'll start with the good one.

The first big point Thornton makes is that machine learning is equivalent to prediction. That is, whenever we can specify some learning task sufficiently precisely to get a machine to do it, it's equivalent to having the machine learn to do predictions of a certain kind. We always want the machine to learn a rule of the form ``If X, then Y''; which is the same, from its perspective, as learning to predict Y when it sees X. In what is called ``concept learning,'' for instance, the output Y is a binary variable, 1 if X is an instance of a certain concept (e.g., ``edible fruit'') and 0 otherwise. Learning the concept is equivalent to learning to predict 0 or 1, depending on X. And so on for more complicated classifications, regression, learning behaviors, and so forth. All this is quite sound. I actually thought this went without saying, but Thornton evidently felt a need to convince his fellow computer scientists of it.

The second big point is that, for a broad class of learning/prediction problems, we have methods which work quite well, and are pretty simple at their core. The easiest is just what's called ``nearest neighbor'': given an unknown input, predict the output of the most similar input whose value is known. One can elaborate such similarity-based methods considerably, but in essence they all involve what Thornton calls ``fence-and-fill'': carve up the input space into simple blocks (with ``fences''), and treat every point in the block the same way (``fill''). The refinements of fence-and-fill methods have to do with how partitions are created and shifted in response to training corrections. (A particularly simple case of fence-and-fill learning is dividing the input space by straight lines; if this is successful, the problem is said to be linearly separable, and this special case seems to motivate a lot of Thornton's thinking, as we'll see.)

The third big point is that some problems are not very amenable to
fence-and-fill learning. He calls these ``relational'' problems, which I think
is a bad name --- but let us leave that for discussing the *other*
book within these covers. The prototype of these problems, to Thornton's mind,
is parity: given a number, guess whether it is even or odd. The kicker here is
that guessing it's the same as its nearest neighbor is always wrong, and no
finite number of (contiguous) partitions will work. If fence-and-fill
won't work, what will?

Two traditional strategies that Thornton considers, but dismisses as not really adequate, are what he calls ``supercharged fence-and-fill'' and ``pick-and-mix''. The first uses various tricks to make fence-and-fill easier to apply, in the hopes of getting some mileage out of it. One can expand the dimensionality of the input space (which in a sense makes it more likely that it clusters), or increase the number and complexity of the fences available. On the other hand, pick-and-mix seeds the learner with some basic relations which it may find helpful, and has it construct increasingly elaborate combinations of these relations in the hopes of predicting more accurately. The problem with supercharged fence-and-fill is that it doesn't really address the nature of the problem (or so Thornton says); the problem with pick-and-mix is that it demands prior knowledge of what is likely to work in this problem domain. Thornton regards neither of these as satisfactory.

What he *does* regard as satisfactory is the ``truth from trash''
procedure of his title. It works like this:

As Thornton notes, truth-from-trash does not have to assume that the clusters it develops at each stage are meaningful in themselves, just that they are going to be useful in deriving a yet-more-accurate cluster at the next step. Thornton doesn't prove that this kind of ``proto-representational learning'' will do any better than supercharged fence-and-fill, much less than appropriately chosen pick-and-mix, but I agree that it does have a better

- Use the chosen fence-and-fill method to process the current input data, that is, to find regions of uniformly labeled stimuli.
- Recode the current input data so as to draw identically labeled data points closer together. ...
- Assign the derived data to be the current data and apply the procedure recursively until data are generated that can be satisfactorily processed solely by the fence-and-fill method. [p. 171]

One of the nicer aspects of truth-from-trash is that it shows how to
construct the kind of abstract, decoupled, ``symbolic'' representation beloved
of Good Old-Fashioned AI from the more dynamical, input-driven, connectionist
devices of ``nouvelle'' AI. (*Pace* Thornton, truth-from-trash
*is* a symbol system within the meaning of the act, or at any rate of Herbert Simon's definition. So are connectionist
systems.) This is all to the good, since (as emphasized by Daniel Dennett, Andy Clark, and others on the side of the
angels) cognition and the nervous system plainly employ both reactive,
embodied, nouvelle-ish devices and abstractive, symbol-manipulating,
disembodied devices, and no good can come of pretending that one or the other
doesn't exist, or that they can't work together. *

Let me now turn to the bad book, which begins about where Thornton starts
talking about hard learning problems, ones fence-and-fill doesn't work very
well on. As mentioned, he calls them ``relational,'' but it will become
clear why I think that's not a very good name for what he has in mind. He
offers three characterizations of these problems, one of which is definitely
not equivalent to the other two, which in turn are not *obviously*
equivalent to each other, though they may be. He does *not* use
``relation'' in the mathematical-logical sense of a set of ordered pairs (or
more generally *n-*tuples). After all, what similarity-learning does is
learn a relationship between the input values and the output, in this sense.
So he needs to mean something more.

The first characterization he offers of what he means is that a relational problem is one where the absolute values of the inputs are uninformative as to the output, but the presence or absence of some relation among the inputs is. This can be characterized perfectly straight-forwardly in information-theoretic terms (though he does not do so): a problem is perfectly relational in this sense if the mutual information between the output variable and any one of the input variables is zero (and the mutual information between the output and the whole set of the input variables is maximal). Moreover, this certainly captures an obvious sense of ``relational''. It's not obvious that this must be much harder than what similarity-learning does, however.

The second characterization is given by his measure of ``geometric
separability,'' defined as ``the proportion of data-points whose nearest
neighbors share the same output''. This clearly means that problems with low
geometric separability are not amenable to similarity-methods, and perhaps not
to fence-and-fill at all (though I'd be much more cautious about that, absent a
proof, than he is). But this is *not* equivalent to his first
characterization (a counter-example will be presented presently).

Finally, he characterizes relational problems as those where the proper
partitions are not ``geometric or spatial'' (p. 119). Absent a
characterization of what he means by ``geometric or spatial,'' this merely
indicates an impoverished view of geometry! (I say this despite being an
almost entirely verbal thinker.) Given his preoccupation with linear
separability and the like, I *guess* that what he has in mind is
something on the order of problems where the correct partition cannot be
represented by a finite number of simply-connected compact sets.

Let me demonstrate that his characterizations are not equivalent with a
counter-example. Consider the relationship <, ``less than,'' among integers
(positive and negative). Our input space is now the x, y plane, and we color a
point red if x < y, and white otherwise. Every point below the 45 degree
diagonal will be white, and every point on or above that line will be red. Now,
clearly just knowing the x coordinate of a point tells us absolutely nothing
about its color. Thus, as we'd hope, learning the relationship < is a
relational problem in sense (1). Unfortunately, in senses (2) and (3), it is
*not* relational. Confined to a finite range of inputs, the geometric
separability is very high, since only points on either side of the line x = y
have neighbors who differ in value from themselves; in the limit that we allow
x and y to range over all integers, the geometric separability approaches 1.
And not only can the problem be tackled geometrically, it is even *linearly
separable.*

I conclude that what Thornton *really* needs is either (2) or (3),
but that neither of these should be called ``relational''. (I do not have a
handy counter-example to the equivalence of senses (2) and (3), and for all I
know they may be equivalent.)

There is also a numerical argument made, that in a non-relational problem the number of partitions to be considered is finite, whereas in a relational one it is infinite. This is not so, but he seems to have been thinking of the following point. The number of binary partitions of a single variable, which has N possible values, is the number of subsets of the input space, which is 2^N. The number of possible relations among two variables, taking on N and M possible values, is 2^(NM), since it is just the number of subsets of the product space, which has cardinality NM. For realistic values of N and M, 2^(NM) is much, much larger than 2^N (or 2^M for that matter). But really in the case that concerns Thornton, the number of values of the input vector as a whole is what's relevant in the first case, and that's NM, so the number of possible partitions is the same.

Whatever you call these problems, they clearly exist and need to be solved.
Supercharging exploits the fact that many problems will *partially*
yield to fence-and-fill, and regards that as better than nothing. Thornton is
rightly skeptical of this, and has an elaborate parable of a poker-playing
machine to make his point. Pick-and-mix will work, but only if the primitive
relationships it's equipped with are adapted to the task. Thornton seems to
doubt that we can arrange for such adaptation in general, as well as rejecting
it on the grounds that it doesn't *guarantee* learning will work.
Unfortunately, as he himself points out, one can show that no such guarantee is
possible. For any problem where a given learning method does better than
blind search, there is another problem where it does worse, and one cannot
guarantee, in a completely *a priori* fashion, that a malevolent demon
(or, as the computer scientists suggestively put it, an Adversary) will not
arrange for us to always try our luck on problems ill-suited to our tools. This is what is called the ``No Free Lunch'' theorem
in machine learning. Its relevance to the actual world is, to say the
least, not immediately obvious, since we do *not* sample the space of
all possible learning problems uniformly, or even under the direction of an
Adversary. It is thus, *pace* Thornton, not at all absurd to suggest
that we can learn what learning algorithms have above-chance performance *on
the kinds of problems we actually encounter.* In other words, the
background knowledge for pick-and-mix *is* acquirable. (I learned this
counter to the no-free-lunch argument from John Holland; so far as I know it's
original to him.)

That the products of one level of learning or neural processing are the raw materials of subsequent levels, which Thornton makes out to be something of a breakthrough associated with truth-from-trash, is in fact a very old notion. He actually quotes John Locke to this effect, without quite realizing, it seems, how ancient and commonplace this notion is. I will merely repeat one of my favorite passages from William James, who puts it quite nicely:

[T]he mind is at every stage a theater of simultaneous possibilities. Consciousness consists in the comparison of these with each other, the selection of some, and the suppression of the rest by the reinforcing and inhibiting agency of attention. The highest and most elaborated mental products are filtered from the data chosen by the faculty next beneath, out of the mass offered by the faculty below that, which mass in turn was sifted from a still larger amount of yet simpler material, and so on. The mind, in short, works on the data it receives very much as a sculptor works on his block of stone. In a sense the statue stood there from eternity. But there were a thousand different ones beside it, and the sculptor alone is to thank for having extricated this one from the rest. Just so the world of each of us, how so ever different our several views of it may be, all lay embedded in the primordial chaos of sensations, which gave the mereThornton regards this by-now elementary fact as delivering a crushing blow to the objectivity of perception and to scientific realism. Of course it does nothing of the kind; it is perfectly possible that the ``mental products''matterto the thought of all of us indifferently. We may, if we like, by our reasonings unwind things back to that black and jointless continuity of space and moving clouds of swarming atoms which science calls the only real world. But all the while the world we feel and live in will be that which our ancestors and we, by slowly cumulative strokes of choice, have extricated out of this, like sculptors, by simply removing portions of the given stuff. Other sculptors, other statues from the same stone! Other minds, other worlds from the same monotonous and inexpressive chaos! My world is but one in a million alike embedded, alike real to those who may abstract them. How different must be the worlds in the consciousness of ant, cuttlefish, or crab! [Principles, vol. I, ch. IX, p. 288]

Thornton rightly connects the kind of change-of-representation which truth-from-trash works with creativity, or at any rate with one aspect of it. His speculations about the sources of aesthetic pleasure seem eminently dismissable (art does not aim at accurate representation, and probably for most of its history creativity as such has not been valued), and he doesn't seem to be aware of a large and respectable literature on computational modeling of creativity and change-of-representation, e.g., the excellent book of Margaret Boden (who is, as it happens, Thornton's colleague in the evolutionary and adaptive systems group at the University of Sussex!). The quotations from Einstein about the superiority of imagination to knowledge are more fit for dorm-room posters than serious discussion.

The trick of mushing the input space into a feature space where easy learning methods work better is not original to truth-from-trash. Thornton surely knows this --- it's featured in tomes on pattern recognition since time out of mind, i.e., the early 1970s at least, and is the core of the ``support vector machine'' approach of Vapnik and co. --- but I think any naive reader would come away thinking otherwise. The main novelty here is the emphasis on a recursive, data-driven recoding, whereas, e.g., the SVM crowd tend to emphasize the use of fixed, hand-crafted kernels.

Outside his immediate domain of machine learning, Thornton is not an
altogether reliable informant. For instance (pp. 45--46), after correctly
giving the formula for the (Shannon) information contained in a random
variable, he tells us that this equation is ``identical --- save for the
leading minus --- to the *entropy* equation from statistical
thermodynamics. Hence the name `negentropy' '' (his italics) for information.
Alas, as a glance at any text on statistical mechanics would show, we
physicists use the minus sign too, and in fact thermodynamic entropy
*is* the Shannon information of the atomic-level physical variables.
(This is actually a rather profound and important point in our discipline.)
--- The name ``negentropy,'' of course, arose in quite a different way, and is
in any case now obsolete, e.g., it does not appear in the index to Cover and Thomas's definitive book.

Or again, consider his description of Karl Popper's philosophy of science (pp. 138--139; all italics Thornton's):

Popper argues that since confirmation of inductively derived theories and laws is out of the question, science should proceed by focusing onNow, one can justly criticize Popper's philosophy of science on many grounds, but claiming that it will not prevent us from making mistakes is hardly one which can be taken seriously. More importantly, this completely misses Popper's point, which is that we should prefer theories which haverefutation.A single counterexample is sufficient to falsify a theory with absolute certainty. Science can therefore build secure foundations by accumulatingfalsifications.The methodological implications follow automatically: respectable scientists should attempt to select new theories according to their degree of falsifiability rather than according to their degree of plausibility.Unfortunately, although Popper's scheme has a certain zany appeal, it is not entirely practical. If we take his proposal literally, there is nothing to prevent us --- as scientists --- from selecting a long sequence of incorrect theories. Provided each theory is falsifiable, we are proceeding in an entirely proper, Popperian manner. But the net effect is, of course, that no real progress is ever made.

The discussion of quantum mechanics is appalling. Wave-particle
duality is a thoroughly obsolete notion, a needlessly misleading way of saying
that models of quantum phenomena have features reminiscent of models of
classical waves and of models of classical particles. But they are
mathematically consistent beasts in their own right, and saying that this is
paradoxical, much less saying that they're both waves and particles as
observation demands, is a bit like saying that human beings exhibit
squirrel-ostrich duality --- we're squirrels in that we are mammals, and
ostriches in as much as we are flightless bipeds. (The poet is right when she
calls us ``hybrids of plants and ghosts,'' but that's metaphor, not ontology.)
Nor does the uncertainty principle have *anything* to do with
measurements disturbing the measured; it doesn't have anything to do with
measurements at all! The worst of it is, he doesn't *need* anything
about quantum mechanics at all! Thornton is trying to make the point that good
scientific theories generally do not make sense of things using our everyday,
``lifeworld'' concepts, a point which can be made by using *any*
developed science. Classical mechanics would do: those who have taught it, or
remember learning it, or know the history of its development, can all testify
that its notions are *not* intuitive ones. The same is true, as I said,
of every science which has reached a creditable stage of development; employing
intuitive notions (as do, e.g., some branches of sociology and psychology) is a
*prima facie* sign that the discipline is not genuinely explanatory at
all. But I would say that this is, if anything, reason to think that
scientific concepts *do* tell us something about reality, independent of
ourselves. If we were just fooling ourselves, physics and biology would look
like psychotherapy, or literary criticism, or libertarian political theory,
which once upon a time --- before 1600, roughly, for physics --- they did.

Let me finally come around to the Deepest Issue that Thornton deals with,
namely induction and the reliability of empirical knowledge (ch. 10). The
quandary is this. All our knowledge of the real world --- of ``matters of fact
and existence,'' in David Hume's words --- derives from empirical evidence
and induction. But induction cannot be deductively justified *a
priori,* cannot be justified from an argument with an empirical premise (on
pain of regress), and cannot even be justified inductively, on the grounds that
it's generally worked in the past, since that is transparently a vicious
circle. (Skeptical arguments against induction are ancient; Hume seems to have
been the first to give the vicious-circle argument.) Therefore, induction
cannot be justified at all, and we have no real reason --- as opposed to mere
animal faith --- to think that any of our generalizations about the real world
will continue to hold.

Bertrand Russell once said that a special section of Hell is reserved for
those who claim to have refuted Hume on this point. Thornton is not in danger
of perdition (not on that count anyway), but a longish spell in Purgatory seems
fair. He claims that the Humean predicament can be evaded by worrying about
data compression instead. We know that we can compress data, and we can easily
tell whether one compression scheme works better than another on a given hunk
of data. Every compression scheme more or less explicitly works by extracting
regularities, so where's the problem? But this misses the point, which is that
we don't just want to compress *this* particular hunk of data; we want a
way of getting regularities worthy of the name, ones which will continue to
hold good for new data, and compression as such does not provide this.
Certainly schemes like Rissanen's minimum
description length principle do not. One can prove that, *if*
certain regularities continue to hold, then a good compression scheme for our
training data will, very likely, remain a good way of compressing test data,
but that does not address the point.

In the same connection, Thornton justifies Occam's Razor, not as a heuristic but as a tautology, in a really extraordinary piece of reasoning:

Simpler inductive inferences necessarily make fewer assumptions about the data. Every assumption has an even chance of being invalid. Thus simpler inductive inferences will, in general, be predictively more successful. If it has any effect at all, then, the application of Occam's Razor within any sort of inductive processThis is absolutely astonishing, coming from somebody who certainly knows about the existence and importance of the accuracy-complexity tradeoff in modeling. If what he says here were literally the case, then we would have no reason to worry about overfitting... (A simple counter-example from my own field: When modeling stochastic processes, the simplest model, by a very well-defined metric, is the one which always treats the process as a sequence of independent, identically-distributed random variables, i.e. as biased coin-tossing or dice-rolling. This is admirably simple, but its accuracy is, in general, dismal.) So hemustimprove predictive accuracy. [p.162, his italics]

I'm only going to touch on the very strange idea that every assumption has an even chance of being wrong, by saying that I don't think it makes any sense. To quote C. S. Peirce, ``if universes were as plenty as blackberries, if we could put a quantity of them in a bag, shake them well up, draw out a sample and examine them,'' then we could give some kind of meaning to the chance of an assumption being wrong. (Even from a Bayesian, subjective-probability view, there is no reason to suppose that all those probabilities are 50-50, and indeed much, like the ``convergence to truth'' theorem, to doubt it.) This assumes that it's possible to individuate assumptions --- to find some sensible way of counting how many of them an inductive method makes.

Let me say that I don't think Thornton is a dumb man or even a bad
researcher. On the contrary, I came to this book with high expectations
because his professional papers show he is a *good* researcher. This
book does not do him justice. He has attempted to combine popular exposition,
a series of new technical proposals, and philosophical reflections in a single
short volume, and, as might've been expected, all his projects have suffered.
I have no doubt that he is aware of the bulk of the related research I have
faulted him for not mentioning, and only very little doubt that he would have
cited it had he not also been aiming for a popular audience. His carelessness
--- and, as the case of negentropy shows, confabulations --- with material from
other disciplines is much more troubling, but again, if he were writing for a
more hard-bitten (not to say embittered) audience of researchers, I think that
would be better controlled. On the other hand, he certainly *can* write
popular science, though he is a bit too fond of using cute stories to make his
points. What he cannot do (here I perform induction on a single instance) is
succeed at writing for both audiences at once. I will certainly continue to
follow his technical papers with every expectation of profiting from them, and
if he writes purely popular books I'll happily read them. (*Warning:*
an awful and unfair but inevitable pun follows.) As for this hybrid, I'm
afraid that he has done too little to separate the truth from the trash.

xii + 204 pp., index of names and subjects (sparse), end-notes for each chapter, bibliography, numerous tables and graphs, a few black and white photos (or rather, quite poorly digitized reproductions of photos)

Artificial Life and Agents / Cognitive Science / Computers and Computing / Philosophy of Science / Popular Science / Probability and Statistics

Currently in print as a hardback, US$29.95, ISBN 0-262-20127-5, LoC Q325.4 T47.

19--20 June 2000

Thanks to Bill Tozier and Danny Yee for advice.