Learning Theory (Formal, Computational or Statistical)

21 Sep 2022 16:00

I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.

The basic set-up is as follows. We have a bunch of inputs and outputs, and an unknown relationship between the two. We do have a class of hypotheses describing this relationship, and suppose one of them is correct. (The hypothesis class is always circumscribed, but may be infinite.) A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis. Generally we assume the data are generated by some random process, and the hypothesis changes as the data change. The key notion is that of a probably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one.

Generally, PAC-results concern (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data. That, however, gets its own notebook.

An interesting question (which I learned of from Vidyasagar's book) has to do with the difference between distribution-free and distribution-dependent bounds. Generally, the latter are sharper, sometimes much sharper, but this comes at the price of making more or less strong parametric assumptions about the distribution. (One might indeed think of the theory of parametric statistical inference as learning theory with very strong distributional assumptions.) However, even in the distribution-free set up, we have a whole bunch of samples from the distribution, and non-parametric density estimation is certainly possible --- could one, e.g., improve the bounds by using half the sample to estimate the distribution, and then applying a distribution-dependent bound? Or will the uncertainty in the distributional estimate necessarily kill any advantage we might get from learning about the distribution? It feels like the latter would say something pretty deep (and depressing) about the whole project of observational science...

To learn more about: stability-based arguments.

See also: Concentration of Measure; Conformal Prediction; Decision Theory; Empirical Process Theory; Ensemble Methods in Machine Learning; Frequentist Consistency of Bayesian Procedures; Low-Regret Learning; Statistics; Statistics of Structured Data; Universal Prediction Algorithms