\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \]

(My notes for this lecture are too incomplete to be worth typing up, so here's the sketch.)

If your method, or your implementation of a common method, doesn't work
under simulations, then you had better have a *very* good argument for
why it should be trusted with real data. *Always* go back and forth
between simulations and method development
\[
\mathrm{simulation}:\mathrm{methods}::\mathrm{unit\ tests}:\mathrm{functions}
\]

Many (most?) useful probability models are easily specified as step-by-step recipes for simulation, but have ugly, complicated distributions for the data. (Nonlinear dependencies and latent variables are both tend to lead to complicated distributions.) Since the models have adjustable parameters, we would like to adjust them so that they match the data we have about the world. If the distributions were straightforward, we could just calculate the likelihood and maximize it (or use Bayesian updating, etc.), but that's not really possible when the distribution is ugly. What to do?

Moments (\(\Expect{X}, \Expect{X^2}, \ldots ) \) are functional of the
probability distribution. If our model has a \( k \)-dimensional parameter
vector \( \theta \), then we should *generally* be able to find \( k \)
moments which characterize or identify the parameter, meaning that there are
functions \( f_1, f_2, \ldots f_k \) such that (i) \( \Expect{X} = f_1(\theta),
\Expect{X^2} = f_2(\theta), \ldots \Expect{X^k} = f_k(\theta) \), and (ii) if
\( \theta_1 \neq \theta_2 \), then at least one of the \( k \) moments differs,
\( f_j(\theta_1) \neq f_j(\theta_2) \) for some \( j \in 1:k \).

If we know the functions \( f_1, \ldots f_k \), and we knew the population
moments, we could then solve for the matching \( \theta \). (We could also
test the model, as opposed to the parameter value, by seeing if that \( \theta
\) let us match *other* population moments; and of course if there were
no solution for \( \theta \).) Since we only see sample moments, in
the **method of moments** we try to solve for the parameters which
match them: \( \hat{\theta}_{MM} \) is the \( \theta \) where
\[
f_i(\theta) = \overline{x^i} ~, ~ i \in 1:k
\]
(We can then test this by looking at whether we can match *other*,
mathematically-independent functionals.)
Since it's annoying to have to keep the subscripts and superscripts \( i \)
around, let's say that \( f(\theta) \) is the vector-valued function of
theoretical moments, and \( m \) is the vector of sample moments. Our earlier assumption (ii) means that \( f^{-1} \) exists, so
\[
\hat{\theta}_{MM} = f^{-1}(m)
\]
By the law of large numbers, we expect \( m \) to approach the population
moments as we see more data. For \( n \) not-too-dependent
observations, the error should be \( O(1/\sqrt{n}) \), and then if \(
f^{-1} \) is a reasonably smooth
function, propagation
of error tells us that \( \|\hat{\theta}_MM - \theta \| = O(1/\sqrt{n}) \)
itself.

It might happen, due to sampling fluctuations, that the observed moments \( m \) are not ones which could ever be theoretical moments for the model --- that \( m \) is not in the range of \( f(\theta) \), no matter how we adjust \( \theta \). The usual approach then is to minimize: \[ \hat{\theta}_{MM} = \argmin_{\theta}{\| f(\theta) - m \|} \] (While I've written this for the ordinary Euclidean norm, it is often better to minimize a weighted norm, giving more emphasis to the moments which have smaller standard errors. This refinement can come later.)

So far I've said nothing about simulation. I've presumed that we can find
the moments as *explicit* functions of the parameters --- that we know
\( f \). However, if we can simulate our model at our favorite \( \theta \),
we can get arbitrarily accurate approximations to \( f \). Specifically, if we
can do \( b \) separate runs of the model at \( \theta \), and average the
output, we get an approximation of \( f(\theta) \), say \( \tilde{f}(\theta)
\), whose error is \( O(1/\sqrt{b} \). We can thus look at
\[
\hat{\theta}_{MSM} = \argmin_{\theta}{\| \tilde{f}(\theta) - m \|}
\]
and try to minimize it, to get our **method of simulated moments**
estimate. There will now be *two* sources of error, simulation error
(shrinking with \( b \), and under our control), and statistical error
(shrinking with \( n \), and out of our control). Again, it can be better to
use a weighted norm, emphasizing the moments which can be estimated more
precisely.

There is nothing magic about moments --- all we need is a set of functionals
of the distribution which characterize the parameters. The foregoing carries
over without modification to the **generalized method of
moments**, and the **generalized method of simulated
moments**. It is often particularly useful to make the "generalized
moments" parameters in some other statistical model, which is mis-specified but
very easy to estimate; then we have indirect inference.

Observe data \( x \). Pick an initial \( \theta \). Do \( b \) independent runs at that parameter value. Use your favorite non-parametric density estimator to get \( \tilde{p}_{\theta} \). (As \( b \rightarrow \infty \), the density estimate converges on the true density, \( \tilde{p}_{\theta} \rightarrow p_{\theta} \).) Now evaluate that estimated density at \( x \), and treat \( \tilde{p}_{\theta}(x) \) as the likelihood. Adjust \( \theta \) to maximize \( \tilde{p}_{\theta}(x) \).

This is not very efficient --- as the dimension of \( x \) grows, \( b \) has to grow exponentially to maintain a constant accuracy. Moreover, our representation of \( \tilde{p}_{\theta} \) is going to be horridly redundant, using a huge number of degrees of freedom when we only need \( k \). There are ways one could begin to get around these issues, but that's beyond the scope of these notes.

Because we can simulate, we can look at what our model predicts for
*arbitrary* aspects of the distribution of data. Basically any
statistic of the data can thus be a test statistic. What makes a good test
statistic?

Obviously, if we adjusted the parameters to match a certain statistic, that is not a good test. (Or at least, I hope it's obvious that if we adjusted the parameters to match the sample variance, we learn nothing by seeing that simulations have about the sample variance.) A little more subtly, nothing which depends mathematically on the statistics we matched is a good test. If we matched the first \( k \) moments, no function of those moments is suitable. (If we matched \( \overline{x} \) and \( \overline{x^2} \), we learn nothing by seeing if we can also match the sample variance.) We might turn to seeing if we can match higher moments, but those tend to be rather unstable, so that doesn't have much power.

The best recommendation I can make is to actually understand the process you're studying, and pick out interesting, relevant features which are logically independent of the moments you matched. This however requires subject-matter knowledge, rather than statistical or computational skill. In the absence of understanding, we can try to check the model using summary statistics other than the moments we used to estimate parameters --- e.g., if we matched the mean, how well do we do on the median? Another useful fall-back is to try re-running exploratory data analyses on simulation output, and seeing how well that matches the EDA on the data.

*Over-identifying*: If \( \theta \) is \( k \)-dimensional but we try
to match more than \( k \) moments, it's not *guaranteed* that we can
match all of them at once. The over-all discrepancy \( \| \tilde{f}(\theta) -
m \| \) should still shrink as we get more data *if* the model is right.
Under some circumstances, one can work out analytically what the distribution
of that discrepancy should be under the model (e.g., \( \chi^2 \) or related),
but you can also just simulate.

*Breaking models*: If you want to know how robust some model
prediction is, you can try to deliberately search for parameters which will not
make that prediction. Again, out of space I won't go into details, but see
Miller's paper. This isn't, strictly speaking, about whether the model fits
the data, though it could be adapted to that.

(Woefully incomplete, merely alphabetical order)

- Andrew Gelman, "A Bayesian Formulation of Exploratory Data Analysis and Goodness-of-fit Testing", International Statistical Review
**71**(2003): 369--382 [PDF] - Andrew Gelman, Xiao-Li Meng and Hal S. Stern, "Posterior predictive assessment of model fitness via realized discrepancies" (with discussion),
Statistica Sinica
**6**(1996): 733--807 - Andrew Gelman and CRS, "Philosophy and the Practice of Bayesian Statistics" (with discussion), British Journal of Mathematical and Statistical Psychology
**66**(2013): 8--38, arxiv:1006.3868 - Christian Gouriéroux and Alain Monfort, Simulation-Based Econometric Methods (Oxford University Press, 1996)
- Christian Gouriéroux, Alain Monfort and E. Renault, "Indirect Inference", Journal of Applied Econometrics
**8**(1993): S85--S118 - David R. Hunter, Steven M. Goodreau and Mark S. Handcock, "Goodness of Fit of Social Network Models", Journal of the American Statistical Association
**103**(2008): 248--258 [PDF preprint] - Bruce E. Kendall, Stephen P. Ellner, Edward Mccauley, Simon N. Wood, Cheryl J. Briggs, William W. Murdoch and Peter Turchin "Population Cycles in the Pine Looper Moth: Dynamical Tests of Mechanistic Hypotheses", Ecological Monographs
**75**(2005): 259--276 [PDF reprint] - John H. Miller, "Active Nonlinear Tests (ANTs) of Complex Simulation Models", Management Science
**44**(1998): 820--830 - Simon N. Wood, "Statistical inference for noisy nonlinear ecological dynamic systems", Nature
**466**(1102--1104) - CRS, lecture 19 (slides, R) in Chaos, Complexity and Inference

Posted at November 25, 2013 10:30 | permanent link