Notebooks

Deviation Inequalities in Probability Theory

19 Jan 2024 11:05

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

The laws of large numbers say that averages taken over large samples converge on expectation values. But these are asymptotic statements which say nothing about what happens for samples of any particular size. A deviation inequality, by contrast, is a result which says that, for realizations of such-and-such a stochastic process, the sample value of this functional deviates by so much from its typical value with no more than a certain probability: \[ \Prob{\left|f(X_1, X_2, \ldots X_n) - \Expect{f}\right| > h } < r(n,h,f) ~, \] where the rate function \( r \) has to be given explicitly, and may depend on the true joint distribution of the \( X_i \) (though it's more useful if it doesn't depend on that very much). (And of course one could compare to the median rather than the mean, or just look at fluctuations above the typical value rather than to other side, or get within a certain factor of the typical value rather than a certain distance, etc.) The rate should be decreasing in \( h \) and in \( n \).

An elementary example is "Markov's inequality": If \( X \) is a non-negative random variable with a finite mean, then \[ \Prob{X > h } < \frac{\Expect{X}} ~. \] One can derive many other deviation inequalities from Markov's inequality by taking \( X = g(Y) \), where \( Y \) is another random variable and \( g \) is some suitable non-negative-valued function.

For instance if \( Y \) has a finite variance \( v \), then \[ \Prob{ |Y-\Expect{Y}| > h } < \frac{v}{h^2} ~. \] This is known as "Chebyshev's inequality". (Exercise: derive Chebyshev's inequality from Markov's inequality. Since Markov was in fact Chebyshev's student, it would seem that the logical order here reverses the historical one, though guessing at priority from eponyms is always hazardous.) Suppose that \( X_1, X_2, \ldots \) are random variables with a common mean \( m \) and variance \( v \), and \( Y \) is the average of the first \( n \) of these. Then Chebyshev's inequality tells us that \[ \Prob{ |Y - m| > h } < \frac{\Var{Y}}{h^2} ~. \] If the \( X_i \) are uncorrelated (e.g., independent), then \( \Var{Y} = v/n \), so the probability that the sample average differs from the expectation by \( h \) or more goes to zero, no matter how small we make \( h \). This is precisely the weak law of large numbers. If the \( X_i \) are correlated, but nonetheless \( \Var{Y} \) goes to zero as \( n \) grows (generally because correlations decay), then we get an ergodic theorem. The rate of convergence here however is not very good, just \( O(n^{-1}) \).

Since \( e^u \) is a monotonically increasing function of \( u \), for any positive \( t \), \( X > h \) if and only if \( e^{tX} > e^{th} \), so we get an exponential inequality, \[ \Prob{ X > h } < e^{-th} \Expect{ e^{tX} } ~. \] Notice that the first term in the bound does not depend on the distribution of \( X \), unlike the second term, which doesn't depend on the scale of the deviation \( h \). We are in fact free to pick whichever \( t \) gives us the tightest bound. The quantity \( \Expect{ e^{tX} } \) is called the "moment generating function" of \( X \), let's abbreviate it \( M_X(t) \), and can fail to exist if some moments are infinite. (Write the power series for \( e^u \) and take expectations term by term to see all this.) It has however the very nice property that when \( X_1 \) and \( X_2 \) are independent, \( M_{X_1+X_2}(t) = M_{X_1}(t) M_{X_2}(t) \). From this it follows that if \( Y \) is the sum of \( n \) independent and identically distributed copies of \( X \), \( M_Y(t) = {(M_X(t))}^{n} \). Thus \[ \Prob{ Y > h }< e^{-th} (M_X(t))^{n} ~. \] If \( Z = Y/n \), the sample mean, this in turn gives \[ \Prob{ Z > h } = \Prob{ Y > hn } < e^{-thn} (M_X(t))^{n} ~. \] So we can get exponential rates of convergence for the law of large numbers from this. (Students who took the CMU statistics department's probability qualifying exam in 2010 now know who wrote problem 9.) Again, the restriction to IID random variables is not really essential, allowing dependence just means that the moment generating functions don't factor exactly, but if they almost factor than we can get results of the same form. (Often, we end up with \( n \) being replaced by \( n/n_0 \), where \( n_0 \) is something like how long it takes dependence to decay to trivial levels.)

I don't feel like going into the reasoning behind the other common deviation bounds --- Bernstein, Chernoff, Hoeffding, Azuma, McDiarmid, etc. --- because I feel like I've given enough of the flavor already. I am using this notebook as, actually, a notebook, more specifically a place to collect references on deviation inequalities, especially ones that apply to collections of dependent random variables. Results here typically appeal to various notions of mixing or decay of correlations, as found in ergodic theory.

"Deviation" vs. "Concentration": A terminological note

The literature is full of references to both "deviation inequalities" and "concentration inequalities". I have never seen a rigorous definition separating the two. When I consulted four scholars more expert than I in this field (I'm just a user, not an innovator, unlike my informants), I received four distinct answers:
  1. "Concentration" inequalities get tighter as we consider more random variables, "deviation" inequalities do not. (So in this sense Markov's inequality is just a deviation inequality, and so is Chebyshev's, but Chebyshev's applied to means of uncorrelated random variables is a concentration inequality.)
  2. If we use a union bound, a "deviation" inequality becomes trivial after considering a merely polynomial-in-\( 1/h \) events, while a "concentration" inequality can handle unions of exponentially many event. (Again, this would make Markov's inequality a "deviation" inequality, but the elaboration using the moment generating function a "concentration" inequality.)
  3. "Deviation" inequalities are one-sided while "concentration" inequalities are two-sided, following the usage on the first page of Massart, 2000. This would make Markov's inequality a deviation result, and Chebyshev's a concentration result.
  4. The terms are completely synonymous.
I think the lesson is that there is no firm distinction between "deviation inequalities" and "concentration inequalities", except for a vague implication that the latter are somehow stronger. (This will, of course, result in multiple people pointing me at authoritative pronouncements about a clear conceptual distinction which I should long have been aware of.)

(Thanks to Michael Kalgalenko for typo-spotting)


Notebooks: