Sufficient Statistics

Last update: 13 Apr 2026 12:57
First version: 17 November 2005; expanded with actual math 8 April 2026

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \]

In statistical theory, a "statistic" is a well-behaved (i.e., "measurable") function of the data, which is what's actually used in calculations or inferences, rather than the full data set. E.g., the sample mean, the sample median, the sample variance, etc. A statistic is sufficient if it is just as informative as the full data. The concept was introduced by R. A. Fisher in the 1920s, and refined by Jerzy Neyman in the 1930s. Writing $ X $ for the data, $ T = \tau(X) $ for the statistic, and $ \theta $ for the parameter, Neyman's factorization criterion is that $ T $ is sufficient when \[ \Prob{X=x;\theta} = g( \tau(x); \theta) h(x) \] That is, the only way in which the data "couples" to the parameter is via the statistic. If we fix $ x $ and let $ \theta $ vary, we get the likelihood function $ L(\theta) \equiv \Prob{X=x; \theta} $. (Capital $ L $ because this is a random function, due to the notationally-suppressed dependence on $ X $.) What the Neyman formulation of sufficiency says is that two data values $ x_1 $ and $ x_2$ where $ \tau(x_1) = \tau(x_2) $ will lead to likelihood function $ L_1 $ and $ L_2 $ which are proportional to each other, $ \frac{L_2(\theta)}{L_1(\theta)} = \frac{h(x_2)}{h(x_1)} $.

Parametric sufficiency means that the statistic contains just as much information about (some) parameter of the model as the full data. More precisely: the actual data has a certain probability distribution conditional on the statistic, which in general will also involve the parameter. The statistic is sufficient if this conditional distribution is the same for all parameter values: \[ \Prob{X=x|T=t; \theta_1} = \Prob{X=x|T=t; \theta_2} \] To see this, go back to the factorization criterion, and the definition of conditional probability: \[ \begin{eqnarray} \Prob{X=x|T=t; \theta} & = & \frac{\Prob{X=x, T=t; \theta}}{\Prob{T=t; \theta}}\\ & = & \frac{\Prob{X=x; \theta}\Indicator{t = \tau(x)}}{\sum_{x^{\prime}}{\Prob{X=x^{\prime}; \theta}\Indicator{t = \tau(x^{\prime})}}}\\ & = & \frac{g(t;\theta) h(x) \Indicator{t = \tau(x)}}{\sum_{x^{\prime}}{g(t;\theta) h(x^{\prime}) \Indicator{t = \tau(x^{\prime})}}}\\ & = & \frac{h(x)}{\sum_{x^{\prime}: \tau(x^{\prime})=t}{h(x^{\prime})}} \end{eqnarray} \]

This means that, conditional on the sufficient statistic, the original data are independent of the parameter, and hence give us no information about it. Predictive sufficiency is similar: given the predictively sufficient statistic, future observations can be predicted as well as if the whole past was available. Predictive sufficiency can be expressed concisely in terms of mutual information.

A necessary statistic is one which can be computed from any sufficient statistic, without reference to the original data. (It's "necessary" in the sense that any optimal inference implicitly involves knowing the necessary statistic.) Under pretty general conditions, maximum likelihood estimates are necessary statistics, though they are not always sufficient. A minimal sufficient statistic is one which is both necessary and sufficient --- i.e., it's just as informative as the original data, but it can be computed from any other sufficient statistic; no further compression of the data is possible, without losing some information.

There is, in fact, a universal construction of the minimal sufficient statistic, which goes as follows for the parametric-sufficiency case. Say that $ x_1 $ and $ x_2 $ are equivalent, $ x_1 \simeq x_2 $, if and only if they lead to proportional likelihood functions, $ \Prob{X=x_1; \theta} \propto \Prob{X=x_2; \theta} $. Since this is an equivalence relation, it partitions the sample space into cells or equivalence classes, where every point in the cell has the same likelihood function (up to a constant), which is different from the likelihood function (or, rather, class of proportional likelihood functions) of every other cell. Write $ [ x ] $ for the equivalence class of $ x $, in symbols $ [ x ] \equiv \left\{ y ~:~ y \simeq x \right\} $. Now consider the function which takes a data point to its equivalence class, that is, the mapping $ \epsilon: x \mapsto [ x ]$. This clearly (i.e., exercise!) is sufficient, because it satisfies the Neyman factorization criterion. But it's also clear (i.e., another exercise) that any coarser partition of the sample space will not obey the factorization criterion. Equally clearly, we could map $ x $ not to its equivalence class, but to anything else in $1-1$ correspondence with the equivalence class (e.g., the likelihood function), and that would also be a minimal sufficient statistic.

A lot of my work has involved describing and finding predictively sufficient statistics for time series and spatio-temporal processes. It turns out that the statistical sufficiency property gives rise to a Markov property for the statistics. (Basically, computational mechanics turns out to be about constructive predictively sufficient statistics.) So I'm very interested in sufficiency in general, and especially how it relates to Markovian representations of non-Markovian processes.

Topics of particular interest: Necessary and sufficient conditions for the existence of non-trivial sufficient statistics; dimensionality of sufficient statistics; geometric and probabilistic characterizations; decision-theoretic properties; necessary statistics; minimal sufficient statistics for transducers; connections to causal inference; relationship between sufficiency and ergodic theory; characterization of different classes of stochastic processes in terms of their sufficient statistics; exponential families.

Sufficiency is a very important topic in statistical inference, and any good book on theoretical statistics will cover it in depth. I like Mark Schervish's Theory of Statistics, but really any one will do.
Persi Diaconis, "Sufficency as Statistical Symmetry", Proceedings of the AMS Centennial Symposium 15--26 [1988; PDF]
E. B. Dynkin, "Sufficient statistics and extreme points", Annals of Probability 6 (1978): 705--730 ["The connection between ergodic decompositions and sufficient statistics is explored in an elegant paper by DYNKIN" --- Kallenberg, Foundations of Modern Probability, p. 577.]

R. R. Bahadur, "Sufficiency and statistical decision functions," Annals of Mathematical Statistics 25 (1954): 423--462
M. S. Bartlett
- "Statistical Information and Properties of Sufficiency", Proceedings of the Royal Society of London A 154 (1936): 124--137 [JSTOR]
- "Properties of Sufficiency and Statistical Tests", Proceedings of the Royal Society of London A 160 (1937): 268--282 [JSTOR]
David Blackwell and M. A. Girshick, Theory of Games and Statistical Decisions [Blackwell was a pioneer in exploring the decision-theoretic properties of sufficiency, and this excellent old book contains many deep theorems in this area]
Ronald W. Butler, "Predictive Likelihood Inference with Applications", Journal of the Royal Statistical Society B 48 (1986): 1--38 ["in the predictive setting, all parameters are nuisance parameters". JSTOR]
John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations. In the limit, this will give them sufficient statistics.]
R. A. Fisher
- "A Mathematical Examination of the Methods of Determining the Accuracy of an Observation by the Mean Error, and by the Mean Square Error", Monthly Notices of the Royal Astronomical Society 80 (1920): 758--770 [Apparently the first time the sufficiency property was noted, though Fisher does not use that term here. PDF]
- "On the Mathematical Foundations of Theoretical Statistics", Philosophical Transactions of the Royal Society A 222 (1922): 309--368 [Formal introduction of the concept, and the name, of sufficiency, along with much else that has proved fundamental to statistics, such as the likelihood function and the method of maximum likelihood. PDF in two parts, 1, 2]
- "Theory of Statistical Estimation", Proceedings of the Cambridge Philosophical Society 22 (1925): 700--725 [Often, but mistakenly, cited in place of the 1922 paper; admittedly, clearer. PDF]
Solomon Kullback, Information Theory and Statistics
Solomon Kullback and R. A. Leibler, "On Information and Sufficiency", Annals of Mathematical Statistics 22 (1951): 79--86
Rudolf Kulhavy, Recursive Nonlinear Estimation: A Geometric Approach
Steffen L. Lauritzen
- Extremal Families and Systems of Sufficient Statistics [Mini-review.]
- "Extreme Point Models in Statistics", Scandinavian Journal of Statistics 11 (1984): 65--91 [Highlights of the book, without proofs but with decent typography. With useful discussion and a reply. JSTOR]
- "Sufficiency, Prediction and Extreme Models", Scandinavian Journal of Statistics 1 (1974): 128--134 [JSTOR]
- "On the Interrelationships among Sufficiency, Total Sufficiency, and Some Related Concepts", Preprint 8, Institute of Mathematical Statistics, University of Copenhagen (July 1974) [PDF scan via Prof. Lauritzen]
Benoit Mandelbrot, "The Role of Sufficiency and of Estimation in Thermodynamics", Annals of Mathematical Statistics 33 (1962): 1021--1038 [Extensive thermodynamic variables as sufficient statistics for the conjugate intensive variables; Gibbs canonical form arising from natural requirements on finite-dimensional sufficient statistics, which can only be achieved for exponential families of probability distributions. Very clever, and IMHO a real contribution to the foundations of statistical mechanics and thermodynamics.]
Giorgio Picci, "Some Connections Between the Theory of Sufficient Statistics and the Identifiability Problem", SIAM Journal on Applied Mathematics 33 (1977): 383--398 [Introduces the idea of a "maximal identifiable statistic" --- the coarsest partition of hypothesis space where each equivalence class/cell of the partition gives rise to a distinct distribution of observables. (I would prefer "parameter", rather than "statistic", since it's a function of the distribution, not the observables, but that's a quibble.) This is, of course, the dual construction to that of the minimal sufficient statistic which I gave above. It might be interesting to try to define emergence in these terms --- perhaps as a restriction on the observable sigma-field such that the equivalence classes of the maximal identifiable parameter become infinite-dimensional, or something like that. JSTOR. Thanks to Rhiannon Weaver for the pointer.]
David Pollard, "A note on insufficiency and the preservation of Fisher information", arxiv:1107.3797
Ge Xu, Biao Chen, "The Sufficiency Principle for Decentralized Data Reduction", arxiv:1207.3265

Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368--377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057

Terence Tao, "Szemerédi's regularity lemma revisited", arxiv:math/0504472 [This is, I think, basically establishing a form of approximate sufficiency, but I need to think seriously about this]

CRS and James P. Crutchfield, "Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction", Advances in Complex Systems 5 (2002): 91--95, arxiv:nlin.AO/0006025

Nihat Ay, Jürgen Jost, Hông Vân Lê, Lorenz Schwachhüfer, "Information geometry and sufficient statistics", arxiv:1207.6736
T. Bohlin, "Information pattern for linear discrete-time models with stochastic coefficients," IEEE Transactions on Automatic Control 15 (1970): 104--106 [On recursively-computable sufficient statistics]
R. Dennis Cook, Liliana Forzani, and Adam J. Rothman, "Estimating sufficient reductions of the predictors in abundant high-dimensional regressions", Annals of Statistics 40 (2012): 353--384
E. B. Dynkin, "Necessary and sufficient statistics for a family of probability distributions," Uspekhi maetm. nauk 6 (1951): 68--90 [Apparently translated in Select. Trans. Math. Statist. Prob. 1 (1951): 23--41. Zacks, below, is supposed to follow closely]
David Hinkley, "Predictive Likelihood", Annals of Statistics 7 (1979): 718--728
V. S. Huzurbazar, Sufficient Statistics: Selected Contributions
Anna Jencova and Denes Petz, "Suffificiency in quantum statistical inference", math-ph/0412093
Kuang-Yao Lee, Bing Li, and Francesca Chiaromonte, "A general theory for nonlinear sufficient dimension reduction: Formulation and estimation", Annals of Statistics 41 (2013): 221--249, arxiv:1304.0580
Yanyuan Ma and Liping Zhu, "Efficient estimation in sufficient dimension reduction", Annals of Statistics 41 (2013): 250--268
W. J. Runggaldier and F. Spizzichino, "Sufficient conditions for finite dimensionality of filters in discrete time: A Laplace transform-based approach," Bernoulli 7 (2001): 211--221
Morris Skibinsky, "Adequate Subfields and Sufficiency", Annals of Mathematical Statistics 38 (1967): 155--161
Taiji Suzuki and Masashi Sugiyama, "Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation", Neural Computation 25 (2013): 725--758
Andrew Tausz, "Properties of Conditional Expectation Operators and Sufficient Subfields", arxiv:1011.5162
Brendan van Rooyen, Robert C. Williamson, "Le Cam meets LeCun: Deficiency and Generic Feature Learning", arxiv:1402.4884
Tao Wang, Xu Guo, Peirong Xu, Lixing Zhu, "Transformed sufficient dimension reduction", arxiv:1401.0267
Makoto Yamada, Gang Niu, Jun Takagi, Masashi Sugiyama, "Sufficient Component Analysis for Supervised Dimension Reduction", arxiv:1103.4998
S. Zacks, The Theory of Statistical Inference [Cited in a number of places for the equivalence-class construction I gave above]