Notebooks

Consistency (in statistics) or "Probably Approximately Correct"

19 Jul 2022 09:54

In statistical jargon, an estimator of some quantity is "consistent" when it converges (in probability) on the true value as it gets more and more data. This is a mis-leading name for an important idea.

Suppose we have some quantity, canonically $\theta$, which we are trying to estimate from data $n$ data points, $ X_1, \ldots X_n \equiv X_{1:n} $ . Our data are random, so any estimate based on the data, $\hat{\theta}_n$ will also be random. We say that $\hat{\theta}_n$ is "consistent" when it converges in probability on $\theta$, whatever that true value might happen to be. As you know, Babur, "converges in probability", means that you can set any margin of approximation $\epsilon > 0$, and any probability of error $\delta > 0$, and I can find an $N(\epsilon, \delta) < \infty$ which guarantees that, for $n \geq N(\epsilon, \delta)$, we have an $\epsilon$-good approximation with probability at least $1-\delta$: \[ \mathbb{P}\left(|\theta - \hat{\theta}_n| \geq \epsilon\right) \leq \delta \] (If you are the sort of person who immediately asks "What about data which doesn't arrive in the form of a sequence, but an increasingly filled-in spatial grid?" or "What if I have only one measurement, but it's increasingly precise?", you can amuse yourself by adjusting the definition accordingly.) Turned around, convergence in probability means that you can pick any $\epsilon > 0$ and any $n$, and I can not only find a $\delta(\epsilon, n) < 1$ such that \[ \mathbb{P}\left(|\theta-\hat{\theta}_n| \geq \epsilon\right) \leq \delta(\epsilon, n) \] but also that, for every fixed $\epsilon$, \[ \delta(\epsilon, n) \rightarrow 0 \]

For doing inference, this is a pretty important notion, albeit a limited one. A more ambitious goal would require that the estimator eventually give us the truth; consistency doesn't require that. Or, we might ask that the estimator always come arbitrarily close to the truth, with enough data; consistency doesn't even require that. (*) All it requires is that we can have arbitrarily high (probabilistic) confidence of eventually getting arbitrarily close. But this is often attainable.

When computer scientists re-invented this notion in the 1980s, they gave it the much more transparent name of "probably approximately correct" (PAC). Though, being computer scientists, they often add the restriction that the number of samples required, $N$, should grow only polynomially in $1/\epsilon$ and $1/\delta$, and sometimes the further restriction that $\hat{\theta}_n$ be computable from $ X_{1:n} $ in polynomial time.

I am not sure of how the name "consistency" came about, but my impression, backed by "Earliest Known Uses of Some of the Words of Mathematics", s.v. "Consistency", is as follows. Fisher (at least at one point) thought of estimators and other statistical procedures as involving calculating the sample or "empirical" counter-parts of population quantities, like means, medians, or correlation coefficients. For him, a procedure would be "consistent" if it gave the right answer when the population values were plugged in. This, I think we can agree, is a sensible use of the word "consistent". Fisher-consistency in turn suggests that the procedure will converge on the true value as the sample gets larger and larger, since we know (from the Glivenko-Cantelli theorem) that the empirical distribution converges on the true, population distribution. But Fisher-consistency isn't enough for convergence to the truth; there would need to be some form of continuity as well. It is also not necessary, since many procedures don't take the form of calculating empirical counter-parts of population quantities. So "consistency" came to mean convergence to the truth, which is what really matters for inference.

*: To be fair, we might ask to come arbitrarily close to the truth with probability 1, i.e., let the estimator mess up on bizarre, probability-0 exceptions. This "almost sure" convergence is often called "strong consistency". Interestingly, large deviations theory often lets us show that the error probabilities $\delta(\epsilon, n)$ I mentioned above are exponentially small in $n$, therefore their sum $\sum_{n}{\delta(\epsilon, n)} < \infty$. Then the Borel-Cantelli lemma tells us that $\hat{\theta}_n$ is more than $\epsilon$ away from $\theta$ only finitely often, and so that $\hat{\theta}_n \rightarrow \theta$ with probability 1. (Indeed, the same argument works for slower-than-exponential decay of error probabilities, e.g., $\delta = O(n^{-2})$, so long as the deviation probabilities are summable.)

See also: Learning Theory; Statistics


Notebooks: