Attention conservation notice: Quasi-teaching note giving an economic interpretation of the Neyman-Pearson lemma on statistical hypothesis testing.
Suppose we want to pick out some sort of signal from a background of noise. As every schoolchild knows, any procedure for doing this, or test, divides the data space into two parts, the one where it says "noise" and the one where it says "signal".* Tests will make two kinds of mistakes: they can can take noise to be signal, a false alarm, or can ignore a genuine signal as noise, a miss. Both the signal and the noise are stochastic, or we can treat them as such anyway. (Any determinism distinguishable from chance is just insufficiently complicated.) We want tests where the probabilities of both types of errors are small. The probability of a false alarm is called the size (or significance level) of the test; it is the measure of the "say 'signal'" region under the noise distribution. The probability of a miss, as opposed to a false alarm, has no short name in the jargon, but one minus the probability of a miss — the probability of detecting a signal when it's present — is called power.
Suppose we know the probability density of the noise \( p \) and that of the signal is \( q \). The Neyman-Pearson lemma, as many though not all schoolchildren know, says that then, among all tests off a given size \( s \) , the one with the smallest miss probability, or highest power, has the form "say 'signal' if \( q(x)/p(x) > t(s) \), otherwise say 'noise'," and that the threshold \( t \) varies inversely with \( s \) . The quantity \( q(x)/p(x) \) is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if its sufficiently more likely than noise.
The likelihood ratio indicates how different the two distributions — the two hypotheses — are at \( x \), the data-point we observed. It makes sense that the outcome of the hypothesis test should depend on this sort of discrepancy between the hypotheses. But why the ratio, rather than, say, the difference \( q(x) - p(x) \), or a signed squared difference, etc.? Can we make this intuitive?
Start with the fact that we have an optimization problem under a constraint. Call the region where we proclaim "signal" \( R \) . We want to maximize its probability when we are seeing a signal, \( Q(R) \), while constraining the false-alarm probability, \( P(R) = s \) . Lagrange tells us that the way to do this is to maximize \( Q(R) - t[P(R) - s] \) over \( R \) and \( t \) jointly. So far the usual story; the next turn is usually "as you remember from the calculus of variations..."
Rather than actually doing math, let's think like economists. Picking the set \( R \) gives us a certain benefit, in the form of the power \( Q(R) \) , and a cost, \(tP(R) \) . (The \( ts \) term is the same for all \( R \) .) Economists, of course, tell us to equate marginal costs and benefits. What is the marginal benefit of expanding \( R \) to include a small neighborhood around the point \( x \) ? Just, by the definition of "probability density", \( q(x) \) . The marginal cost is likewise \( tp(x) \) . We should include \( x \) in \( R \) if \( q(x) > tp(x) \), or \( q(x)/p(x) > t \) . The boundary of \( R \) is where marginal benefit equals marginal cost, and that is why we need the likelihood ratio and not the likelihood difference, or anything else. (Except for a monotone transformation of the ratio, e.g. the log ratio.) The likelihood ratio threshold \( t \) is, in fact, the shadow price of statistical power.
I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained marginally before, but in retrospect it seems too simple to be new, so pointers would be appreciated.
Updates: Thanks to David Kane for spotting a typo.
15 July 2012: fixed a typo which had "minimize" where I meant "maximize".
*: Yes, you could have a randomized test procedure, but the situations where those actually help pretty much define "boring, merely-technical complications."
Posted at November 08, 2009 03:06 | permanent link