Notebooks

Sufficient Statistics

Last update: 13 Apr 2026 12:57
First version: 17 November 2005; expanded with actual math 8 April 2026

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \]

In statistical theory, a "statistic" is a well-behaved (i.e., "measurable") function of the data, which is what's actually used in calculations or inferences, rather than the full data set. E.g., the sample mean, the sample median, the sample variance, etc. A statistic is sufficient if it is just as informative as the full data. The concept was introduced by R. A. Fisher in the 1920s, and refined by Jerzy Neyman in the 1930s. Writing $ X $ for the data, $ T = \tau(X) $ for the statistic, and $ \theta $ for the parameter, Neyman's factorization criterion is that $ T $ is sufficient when \[ \Prob{X=x;\theta} = g( \tau(x); \theta) h(x) \] That is, the only way in which the data "couples" to the parameter is via the statistic. If we fix $ x $ and let $ \theta $ vary, we get the likelihood function $ L(\theta) \equiv \Prob{X=x; \theta} $. (Capital $ L $ because this is a random function, due to the notationally-suppressed dependence on $ X $.) What the Neyman formulation of sufficiency says is that two data values $ x_1 $ and $ x_2$ where $ \tau(x_1) = \tau(x_2) $ will lead to likelihood function $ L_1 $ and $ L_2 $ which are proportional to each other, $ \frac{L_2(\theta)}{L_1(\theta)} = \frac{h(x_2)}{h(x_1)} $.

Parametric sufficiency means that the statistic contains just as much information about (some) parameter of the model as the full data. More precisely: the actual data has a certain probability distribution conditional on the statistic, which in general will also involve the parameter. The statistic is sufficient if this conditional distribution is the same for all parameter values: \[ \Prob{X=x|T=t; \theta_1} = \Prob{X=x|T=t; \theta_2} \] To see this, go back to the factorization criterion, and the definition of conditional probability: \[ \begin{eqnarray} \Prob{X=x|T=t; \theta} & = & \frac{\Prob{X=x, T=t; \theta}}{\Prob{T=t; \theta}}\\ & = & \frac{\Prob{X=x; \theta}\Indicator{t = \tau(x)}}{\sum_{x^{\prime}}{\Prob{X=x^{\prime}; \theta}\Indicator{t = \tau(x^{\prime})}}}\\ & = & \frac{g(t;\theta) h(x) \Indicator{t = \tau(x)}}{\sum_{x^{\prime}}{g(t;\theta) h(x^{\prime}) \Indicator{t = \tau(x^{\prime})}}}\\ & = & \frac{h(x)}{\sum_{x^{\prime}: \tau(x^{\prime})=t}{h(x^{\prime})}} \end{eqnarray} \]

This means that, conditional on the sufficient statistic, the original data are independent of the parameter, and hence give us no information about it. Predictive sufficiency is similar: given the predictively sufficient statistic, future observations can be predicted as well as if the whole past was available. Predictive sufficiency can be expressed concisely in terms of mutual information.

A necessary statistic is one which can be computed from any sufficient statistic, without reference to the original data. (It's "necessary" in the sense that any optimal inference implicitly involves knowing the necessary statistic.) Under pretty general conditions, maximum likelihood estimates are necessary statistics, though they are not always sufficient. A minimal sufficient statistic is one which is both necessary and sufficient --- i.e., it's just as informative as the original data, but it can be computed from any other sufficient statistic; no further compression of the data is possible, without losing some information.

There is, in fact, a universal construction of the minimal sufficient statistic, which goes as follows for the parametric-sufficiency case. Say that $ x_1 $ and $ x_2 $ are equivalent, $ x_1 \simeq x_2 $, if and only if they lead to proportional likelihood functions, $ \Prob{X=x_1; \theta} \propto \Prob{X=x_2; \theta} $. Since this is an equivalence relation, it partitions the sample space into cells or equivalence classes, where every point in the cell has the same likelihood function (up to a constant), which is different from the likelihood function (or, rather, class of proportional likelihood functions) of every other cell. Write $ [ x ] $ for the equivalence class of $ x $, in symbols $ [ x ] \equiv \left\{ y ~:~ y \simeq x \right\} $. Now consider the function which takes a data point to its equivalence class, that is, the mapping $ \epsilon: x \mapsto [ x ]$. This clearly (i.e., exercise!) is sufficient, because it satisfies the Neyman factorization criterion. But it's also clear (i.e., another exercise) that any coarser partition of the sample space will not obey the factorization criterion. Equally clearly, we could map $ x $ not to its equivalence class, but to anything else in $1-1$ correspondence with the equivalence class (e.g., the likelihood function), and that would also be a minimal sufficient statistic.

A lot of my work has involved describing and finding predictively sufficient statistics for time series and spatio-temporal processes. It turns out that the statistical sufficiency property gives rise to a Markov property for the statistics. (Basically, computational mechanics turns out to be about constructive predictively sufficient statistics.) So I'm very interested in sufficiency in general, and especially how it relates to Markovian representations of non-Markovian processes.

Topics of particular interest: Necessary and sufficient conditions for the existence of non-trivial sufficient statistics; dimensionality of sufficient statistics; geometric and probabilistic characterizations; decision-theoretic properties; necessary statistics; minimal sufficient statistics for transducers; connections to causal inference; relationship between sufficiency and ergodic theory; characterization of different classes of stochastic processes in terms of their sufficient statistics; exponential families.


Notebooks:   Powered by Blosxom