## Confidence Sets, Confidence Intervals

*31 Aug 2022 11:38*

This is, to my mind, one of the more beautiful and useful ideas
in statistics, but also one of the more tricky.
(I might admire them more *because* they are tricky.)

We have some parameter of a stochastic model we want to learn about,
proverbially \( \theta \), which lives in the parameter space \( \Theta \). We
observe random data, say \( X \). The distribution of \( X \) changes with \(
\theta \), so the probability law is \( P_{\theta} \). Our game is one of
"statistical inference", i.e., we look at \( X \) and make a guess about \(
\theta \) on that basis. One type of guess would be an exact value for \(
\theta \), a *point estimate*. But we'd basically never expect any
point estimate to be exactly right, and we'd like to be able to say something
about the uncertainty. A **level \( \alpha \) confidence set** is
a *random* set of parameter values \( C_{\alpha} \subseteq \Theta \)
which contains the true parameter value, *whatever it might happen to
be*, with probability \( \alpha \) (at least):
\[
\min_{\theta \in \Theta}{P_{\theta}(\theta \in C_{\alpha})} \geq \alpha
\]
We say that \( C_{\alpha} \) has **coverage level** \( \alpha \).

Quibbles:

- It's (pragmatically) implied that the coverage probability is \( =\alpha \) for at least some \( \theta \); if the probability is \( > \alpha \) for all \( \theta \), we say the confidence set is "conservative".
- If you know enough to quibble about "min" vs. "inf", you also know what I meant.
- \( C_{\alpha} \) is really \( C_{\alpha}(X) \), a (measurable) function of the data, but I am trying to keep the notation under control.
- In many situations there will be other ("nuisance") parameters we
*don't*care about, canonically \( \psi \), and then we have to consider the worst case over both \( \theta \) and \( \psi \) simultaneously, even if really only want to draw inference about \( \theta \).

#### Either the confidence set contains the truth, or we were really unlucky

Now, confidence sets are notoriously hard for learners to wrap their minds
around, but I have a way of explaining them which *seems* to work when I
teach, and so I might as well share.

When I construct a confidence set from our data, I am offering you, the
reader, a dilemma: *Either*

- the true parameter value is in the confidence set \( C_{\alpha} \),
*or* - we were very unlucky, and we got data that was very improbable (\( P \leq
1-\alpha \) and unrepresentative under
*all*values of the parameter.

*regardless*of \( \theta \).

(More strictly there is really a *tri*-lemma here:

- the true parameter value is in the confidence set \( C_{\alpha} \),
*or* - we were very unlucky, and we got data that was very improbable (\( P \leq
1-\alpha \) and unrepresentative under
*all*values of the parameter,*or* - the model we're using to calculate probabilities is wrong.

#### The confidence set is every parameter value we can't reject

At this point a very reasonable question is to ask how on Earth we're supposed to find such a set. Here is one very general procedure. Suppose that we can statistically test whether \( \theta = \theta_0 \). That is, we have some function \( T(X;\theta_0) \) which returns 0 if \( X \) looks like it could have come from \( theta=\theta_0 \), and returns 1 otherwise. More concretely, \( P_{\theta_0}{(T(X;\theta_0) = 1)} \leq 1-\alpha \), so the "false positive" rate or "false rejection" rate is at most \( 1-\alpha \). (That is, the "size" of the test is at most \( 1-\alpha \), over all parameter values.) Now building \( C_{\alpha} \) is very easy: \[ C_{\alpha}(X) = \left\{ \theta \in \Theta ~ : ~ T(X;\theta) = 0 \right\} \] (Here I am being explicit that \( C_{\alpha} \) is a function of the data \( X \), which I otherwise suppress in the notation.)

In words: the confidence set consists of all the parameter values we compatible with the data, i.e., all the parameter values we can't reject (at any acceptably low error rate \( 1-\alpha \) ).

This construction is called "inverting the hypothesis test". Clearly, any hypothesis test gives us a confidence set, by inversion. Equally clearly, any confidence set can be used to give a hypothesis test: to test whether \( \theta = \theta_0 \), see whether \( \theta_0 \in C_{\alpha} \); the false-rejection rate of this test is, by construction, \( \leq 1-\alpha \).

It is a little less clear that *every* confidence set can be
constructed by inverting *some* test, but it's nonetheless true, and
a textbook result (see, e.g., Casella and Berger, or Schervish). This is called the "duality between hypothesis tests and confidence sets".

#### Consistency and Evidence

Now at this point you might feel we're done, because we've got a range of
parameter values which we know is right with high probability. Of course you
might worry about what probability means about any *particular* case,
but there's no *special* difficulty about that here, as opposed to (say)
predicting the risk of rain tomorrow. But there is an additional
wrinkle here, which has to do with consistency, or *convergence* to the truth.

Suppose we get larger and larger data sets, \( X_n \) with \( n \rightarrow
\infty \). For each one, we construct a confidence set \( C_{\alpha}(X_n) \).
What we would *like* to have happen is for these sets to get smaller and
smaller, and to converge on the true value, \( C_{\alpha} \rightarrow \theta
\). That is, if the true \( \theta \neq \theta_0 \), we'd like \(
P_{\theta}(\theta_0 \in C_{\alpha}(X_n)) \rightarrow 1 \) as \( n \rightarrow
\infty \). If we think about things in terms of the hypothesis test, we'd like
the probability of *correctly* rejecting the *wrong* parameter
values to go to 1 as we get more and more data (at constant false-rejection
probability). So: inverting a consistent hypothesis test gives us a consistent
confidence set (one which converges on the truth), and vice versa.

If we have a consistent confidence set, then, I claim, we've
got *evidence* that the true parameter value is in the set.

(When a parameter is only partially identified, then inverting consistent tests will give confidence regions converging to the set of observationally-equivalent parameter values, rather than to a single point.)

#### Confidence Intervals

I have written about confidence "sets" because the basic logic is very abstract and doesn't rely on any geometric properties of the parameter space. But in many situations the parameters we're interested in are real number, and the test functions \( T(X;\theta) \) are piece-wise constant in \( \theta \). This is the sort of situation where the confidence set we'll get by inverting a test is an interval. In a few Euclidean dimensions, we might get a ball or box, or anyway some sort of compact, connected region. But in many of the situations I'm interested in, the parameter of interest is something like a function or a network, and "interval" just isn't going to cut it.- See also:
- Bootstrapping, and Other Resampling Methods (for one particularly useful way of building confidence sets)
- Nonparametric Confidence Sets for Functions
- Partial Identification
- Conformal Prediction
- Gygax Tests

- Recommended, big picture but textbook treatments:
- George Casella and R. L. Berger, Statistical Inference
- Mark J. Schervish, Theory of Statistics

- Recommended, close-ups:
- Don Fraser, "Is Bayes posterior just quick and dirty confidence",
Statistical Science
**26**(2011): 299--316, arxiv:1112.5582 [See also the discussions by others, and Fraser's reply. My answer to the question posed in Fraser's title is "yes", or rather "YES!"]

- Recommended, big picture, historical:
- Trygve Haavelmo, "The Probability Approach in Econometrics", Econometrica
**12**supplement (1944): iii--115 - Jerzy Neyman, "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", Philosophical Transactions of the Royal Society of London A
**236**(1937): 333--380

- To read:
- Tore Schweder and Nils Lid Hjort, Confidence, Likelihood, Probability: Statistical Inference with Confidence Distributions