June 22, 2012

Ockham Workshop, Day 1

Those incorporated by reference: Workshop on Foundations for Ockham's Razor; "The Analytical Language of John Wilkins"

Those that belong to the emperor: The morning's first speaker was Vladimir Vapnik, who made fundamental contributions to machine learning from the 1970s through the 1990s. Reading his book The Nature of Statistical Learning Theory is one of the reasons I became a statistician. This was the first time I'd heard him speak. No summary I could give here would do justice to the unique qualities of his presentation.

Those that are trained: Peter Grünwald talked about the role of results which look very much like Bayesian posteriors, but make no sense within the Bayesian ideology, in deriving data-dependent bounds on the regret or risk of predictors. This material drew on the discussion of "luckiness" in his book on The Minimum Description Length Principle. At the risk of distorting his meaning somewhat, I'd say that the idea here is to use a prior or penalty which deliberately introduces a bias towards some part of the model space. The bias stabilizes the predictions, by reducing variance, but the bias isn't so big as to produce horrible results when reality (or the best-approximation-to-reality-within-the-model-space) doesn't line up with the bias. This leads to an Occam-like result because it is very hard to design priors or penalties which obey the Kraft inequality and don't bias the learner towards smaller models --- that don't give first-order Markov chains, for instance, more weight than second-order chains. More exactly, you can design perverse priors like that, but then there are always non-perverse (healthy?) priors which have uniformly tighter performance bounds.

Those drawn with a very fine camel-hair brush: Larry Wasserman talked about the need to pick control settings and how, despite what we keep telling students, we don't really have an entirely satisfactory way of doing this, a point which was graphically illustrated with some instances of cross-validation consistently picking models which were "too large" --- too many variables in Lasso regression, too many edges in graph inference, etc. Part of the problem is that CV error curves typically are asymmetric, with a very sharp drop-off as one gets rid of under-fitting by including more complexity, and then a very shallow rise as complexity increases. Larry hinted st some ideas for taming this, by testing whether an apparent improvement in CV from extra complexity is real, but had no strong solution to offer, and ended up calling for a theory of when and how we should over-smooth.

Those that resemble flies from a distance: Elliott Sober closed out the first day by trying to explicate two examples. The first (one which he spent less time) was why, when two students turn in exactly the same essay, we regard it as more plausible that they both copied from a common source, as opposed to their just happening to produce the exact same string of words. He mentioned Wesley Salmon, and Reichenbach's Common Cause Principle, but not, sadly, "Pierre Menard, Author of Don Quixote". Sober devoted more attention, however, to issues of parsimony in phylogenetics, where one of the most widely used techniques to reconstruct the state of ancestral organisms is to posit the smallest number and size (in some sense) of changes from the ancestor to its descendants, taking into account that some of the descendants may have more recent common ancestors. Thus humans and chimpanzees are more closely related to each other than to gorillas; since humans alone have opposable thumbs, the minimum-parsimony inference is that the common ancestor did not, and thumbs evolved once, rather than being lost on two separate occasions. (Such reasoning is also used, as he said, to figure out what the best tree is, but he focused on the case where the tree is taken as given.) What he mostly addressed is when parsimony, in this sense, ranks hypotheses in the same order as likelihood. (He did not discuss when parsimony leads to accurate inferences.) The conditions needed for parsimony and likelihood to agree are rather complicated and disjunctive, making parsimony seem like a mere short-cut or hack — if you think it should be matching likelihood. He was, however, clear in saying that he didn't think hypotheses should always be evaluated in terms of likelihood alone. He ended by suggesting that "parsimony" or "simplicity" is probably many different things in many different areas of science (safe enough), and that when there is a legitimate preference for parsimony, it can be explained "reductively", in terms of service to some more compelling goal than sheer simplicity.

Those that are included in this session classification: optical character recognition; Statistical Modeling: The Two Cultures; falsification (in at least two different senses); at least three different distinctions between "induction" and "deduction"; cladistics; Kant's Critique of Judgment; instrumentalism; gene expression data; cross-validation; why all machine learning assumes stationary data; machine learning results which do not rely on stationarity; Chow-Liu trees.

Fabulous ones, or perhaps stray dogs: classical statisticians (as described by several speakers).

Those that tremble as if they were mad with eagerness to help participants: Maximum Likelihood, an Introduction; Conditioning, Likelihood, and Coherence; Prediction, Learning, and Games; The Role of Ockham's Razor in Knowledge Discovery; Falsification and future performance

And now, if you'll excuse me, I need more drinks to revise my slides for tomorrow. If you really must know more, see Larry, or Deborah Mayo.

Update, next morning: disentangled some snarled sentences.

Manual trackback: Organizations and Markets

Enigmas of Chance; Philosophy

Posted at June 22, 2012 22:27 | permanent link

Three-Toed Sloth