Problems Worth Solving
is blogging (again), and
anyone who finds my writings interesting would do better to read his.
Larry's latest post is a call for
unsolved problems in statistics and machine learning. As he says, the
current Wikipedia page
problems in statistics is not what he's looking for, as all the
examples are either "boring", "interesting but vague", or silly (or, as he puts
it, "you've got to be kidding"). As he says, a good big problem is one which
can be stated succinctly, can be explained to a non-specialist, and "when you
explain it to someone they think it is cool, even if they don't know what you
are talking about".
I am not sure that any of these really qualify, because they're all
less sexy than the "P=NP", or Navier-Stokes problems Larry has in mind
from other disciplines. But they're ones which bug me, and seem like
they might have more leverage than (save me) checking admissibility of some
point estimate or another. I have some ideas about some of them, but hopefully
my collaborators will not regard me as setting us up to be scooped by mentioning
the problems. And if you know of solutions to any of these, please do tell.
- Devise an algorithm which tells us exactly how much of a statistical model
can be linked to data, and how much must be taken on faith. More specifically,
given a parametric model, either return that it is fully
identified, or, if it's not, construct
the maximal identifiable
parameter, and figure out all of the parameters which
- Apply said algorithm to the problem
regression, i.e., where we have more predictor variables than data-points.
How much of the regression function is identified?
- When and how does sub-sampling or re-sampling give valid generalization
- When we don't have that much data but do have a bunch of similar cases, we
often want to do partial pooling. The traditional way to do partial pooling is
Bayesian models. Do we lose anything by instead using shrinkage penalties,
with the size of the penalties picked by cross-validation? If so, what do we
- How do we put valid confidence intervals on coefficients of linear models
which were estimated with penalties, like ridge regression or the lasso?
- Find sufficient (and necessary?) conditions for the lasso to be
model-selection-consistent which users can (a) understand and (b) check.
- In applications, everyone does
multi-fold cross-validation. When and
why, exactly, does it work? When will it definitely not work? How
different are the answers if "work" means "predict well out of sample" or "get
the model structure right"?
complexity puts powerful bounds on generalization error for bounded loss
functions. What is the equivalent for unbounded loss functions (like squared
- Three technologies which have become quite central to actually doing
statistics since the 1980s are smoothing (for regression and density
estimation), the bootstrap (for
for picking control settings. We know how to do these for independent data,
and for time series, and more or less for
spatial data. How do we make any of them work for networks, or for relational
data more generally?
- For that matter, how do we draw big networks in a way which doesn't lead to what M. E. J. Newman calls a "ridiculogram"?
- Why does boosting work? (I incline
and Freund's margin-based explanation, but it's certainly not settled.)
- Most current ensemble methods
are simply averaging/voting, perhaps with weights. What advantages, if any,
are there to more elaborate aggregation mechanisms?
- If a mixture-of-experts algorithm achieves low regret, and there is a
uniform generalization error bound for the experts' risks, when can we bound
the ensemble's risk?
(Cf. a, b.) If we can give such a bound, is there any reason to not use low-regret algorithms?
- What is the weakest dependence condition on a time series which lets us
give distribution-free bounds like the IID ones? The weakest dependence
condition for a random field? Can these dependence conditions be tested?
- Develop mis-specification tests where failing the test indicates how the
model should be changed.
smooth test and
algorithm are examples of what I have in mind, so this might hit the
"interesting but vague" flaw.)
- What are the weakest conditions under
which causal discovery is
uniformly consistent? The weakest ones which could be invoked in good
conscience in applications?
- Suppose that the real data-generating process is an
absurdly-high-dimensional linear system, but all the variables we measure are
random, comparatively-low-dimensional projections of the real variables (not
necessarily onto linear surfaces). What would typical regression relations
among such projections look like? How different are they from the regression
functions we actually see? Should we expect sparsity under such conditions, or
- Give a rule for picking the number of clusters which (a) converges to the
true number of clusters when there is one, and (b) predicts well even when
there really isn't a finite number of clusters. (Presumably any good
answer here would also work for community discovery in networks.)
- Give a similar rule for picking the dimension of a latent variable,
without (necessarily) assuming linear structure.
- Many model families contain a discrete, "structural" parameter --- graphs
in graphical models, trees in phylogenies or hierarchical clusterings, etc.
How do we give a confidence set for such parameters? We could just test each
possibility, and report the set which we don't reject, but this is too
expensive because of combinatorial explosions, and would often look ugly; could
we do some sort of branch-and-bound? If we try to pick the discrete structure
through model selection, how do
we handle not having a single chain of nested models, but rather a lattice?
How do we detect significant differences in structure, or do partial pooling
Something which is definitely too vague would be "when, if ever, is
parsimony a good guide to choosing between models?".
what I'll be talking about for the next few days.
Update, next day: Fixed some typos, added some missing
links, added one problem at the end.
Statistics was nice enough to link to this, observing that most (I'd say
all) of these problems are "category 1"
three-part scheme, motivated by internal considerations from statistics. I
don't think any are motivated by convenient data sets (their category 2), or
pressing scientific questions (category 3). For the record I think that's
unfortunate; it's problems in the last category, where there's an actual issue
about the world we need to resolve, which matter most. (Their "category 2"
seems like exercises — perhaps very difficult exercises.) But for better
or for worse, it's a lot easier for me to think of problems in the first
category. What gives those problems their interest, though, is the hope that
they could be used to answer lots of third-category questions...
Enigmas of Chance;
Kith and Kin
Posted at June 21, 2012 23:09 | permanent link