Estimating Entropies and Informations

Last update: 21 May 2025 13:53
First version: 31 May 2007 (or earlier?)

The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the macroscopic state, at least once we fix our partition of microscopic states into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, sometimes called the "empirical entropy". However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.

What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping. [Update, 2022: That guess was wrong! Moon and Hero (2014) and Kandasamy et al. (2015) both give non-parametric estimators with (asymptotic) confidence intervals that don't require bootstrapping. I particularly like the Kandasamy et al. approach, but I admit to a certain amount of home-team bias there.]

For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.

Using gzip is a bad idea (for this purpose; it works fine for data compression).

(Thanks to Sankaran Ramakrishnan for pointing out a think-o in an earlier version.)

Bootstrapping Entropy Estimates
Complexity, Entropy and the Physics of gzip
Directed Information and Transfer Entropy (where I stash references on estimating those quantities)
Nearest Neighbors

D. J. Albers, George Hripcsak, "Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series", arxiv:1110.1615
Jose M. Amigo, Janusz Szczepanski, Elek Wajnryb and Maria V. Sanchez-Vives, "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity", Neural Computation 16 (2004): 717--736 [Normally, I have strong views on using Lempel-Ziv to measure entropy rates, but here they are using the 1976 Lempel-Ziv definitions, not the 1978 ones. The difference is subtle, but important; 1978 leads to gzip and practical compression algorithms, but very bad entropy estimates; 1976 leads, as they show numerically, to reasonable entropy rate estimates, at least for some processes. Thanks to Dr. Szczepanski for correspondence about this paper.]
Thomas B. Berrett, Richard J. Samworth, Ming Yuan, "Efficient multivariate entropy estimation via k-nearest neighbour distances", Annals of Statistics (forthcoming), arxiv:1606.00304
J.-R. Chazottes and D. Gabrielli, "Large deviations for empirical entropies of Gibbsian sources", Nonlinearity 18 (2005): 2545--2563 , math.PR/0406083 [This is a very cool result which shows that block entropies, and entropy rates estimated from those blocks, obey the large deviation principle even as one lets the length of the blocks grow with the amount of data, provided the block-length doesn't grow too quickly (only logarithmically). I wish I could write papers like this.]
Simon DeDeo, Robert X. D. Hawkins, Sara Klingenstein, Tim Hitchcock. "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems", arxiv:1302.0907
John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
Yoshito Hirata and Alistair I. Mees, "Estimating topological entropy via a symbolic data compression technique", Physical Review E 67 (2003): 026205
Yongmiao Hong and Halbert White, "Asymptotic Distribution Theory for Nonparametric Entropy Measures of Serial Dependence", Econometrica 73 (2005): 837--901 [JSTOR; PDF reprint via Prof. White]
Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins, "Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations", NIPS 2015, arxiv:1411.4342
Matthew B. Kennel, Jonathon Shlens, Henry D. I. Abarbanel and E. J. Chichilnisky, "Estimating Entropy Rates with Bayesian Confidence Intervals", Neural Computation 17 (2005): 1531--1576
Ioannis Kontoyiannis, P. H. Algoet, Yu. M. Suhov and A. J. Wyner, "Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text", IEEE Transactions on Information Theory 44 (1998): 1319--1327
Alexander Kraskov, Harald Stögbauer and Peter Grassberger, "Estimating Mutual Information", Physical Review E 69 (2004): 066138, cond-mat/0305641
Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, Larry Wasserman, "Nonparametric Estimation of Renyi Divergence and Friends", arxiv:1402.2966
Kevin R. Moon, Alfred O. Hero III, "Multivariate f-Divergence Estimation With Confidence", arxiv:1411.2045
Ilya Nemenman, "Inference of entropies of discrete random variables with unknown cardinalities," physics/0207009
Dávid Pál, Barnabás Póczos, Csaba Szepesvári, "Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs", arxiv:1003.1954
Liam Paninski, "Estimation of Entropy and Mutual Information", Neural Computation 15 (2003): 1191--1253 [Preprint; code]
Barnabas Poczos, Jeff Schneider, "On the Estimation of alpha-Divergences", AIStats 2011
Thomas Schuermann and Peter Grassberger, "Entropy estimation of symbol sequences," Chaos 6 (1996): 414--427, cond-mat/0203436
Jonathon Shlens, Matthew B. Kennel, Henry D. I. Abarbanel, E. J. Chichilnisky, "Estimating Information Rates with Confidence Intervals in Neural Spike Trains", Neural Computation 19 (2007): 1683--1719
Jonathan D. Victor, "Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials", Neural Computation 12 (2000): 2797--2804 [Calculates the bias in the empirical entropy, as an estimator of the true entropy, under IID sampling of a discrete space. Interestingly, the first-order (1/n) term in the bias does not depend on the actual distribution, though the higher-order terms do.]
Vincent Q. Vu, Bin Yu, Robert E. Kass, "Information In The Non-Stationary Case", arxiv:0806.3978
Benjamin Weiss, Single Orbit Dynamics [Discusses procedures for non-parametrically estimating entropy of suitably ergodic sources, using just one realization of the process.]

Octavio César Mesner and CRS, "Conditional Mutual Information Estimation for Mixed Discrete and Continuous Variables with Nearest Neighbors", IEEE Transactions on Information Theory 67 (2021): 464--484, arxiv:1912.03387

Evan Archer, Il Memming Park, Jonathan Pillow, "Bayesian Entropy Estimation for Countable Discrete Distributions", arxiv:1302.0328
Andrew D. Back, Daniel Angus and Janet Wiles, "Determining the Number of Samples Required to Estimate Entropy in Natural Sequences", IEEE Transactions on Information Theory 65 (2019): 4345--4352
M. S. Baptista, E. J. Ngamga, Paulo R. F. Pinto, Margarida Brito, J. Kurths, "Kolmogorov-Sinai entropy from recurrence times", arxiv:0908.3401
Andre C. Barato, David Hartich and Udo Seifert, "Rate of Mutual Information Between Coarse-Grained Non-Markovian Variables", Journal of Statistical Physics 153 (2013): 460--478
D. Benedetto, E. Caglioti, G. Cristadoro, M. Degli Esposti, "Relative entropy via non-sequential recursive pair substitutions", arxiv:1007.3384 [The first two authors were also the lead authors of the epic-fail "Language Trees and Zipping" paper, arxiv:cond-mat/0108530, but perhaps they've improved.]
Visar Berisha, Alfred O. Hero, "Empirical non-parametric estimation of the Fisher Information", IEEE Signal Processing Letters 22 (2015): 988--992, arxiv:1408.1182
John A. Berkowitz and Tatyana O. Sharpee, "Quantifying Information Conveyed by Large Neuronal Populations", Neural Computation 31 (2019): 1015--1047
Jeremiah Birrell, Markos A. Katsoulakis, Yannis Pantazis, "Optimizing Variational Representations of Divergences and Accelerating Their Statistical Estimation", IEEE Transactions on Information Theory 68 (2022): 4553--4572
Juan A. Bonachela, Haye Hinrichsen, Miguel A. Munoz, "Entropy estimates of small data sets", arxiv:0804.4561
Salim Bouzebda and Issam Elhattab, "Uniform-in-bandwidth consistency for kernel-type estimators of Shannon's entropy", Electronic Journal of Statistics 5 (2011): 440--459
Clive G. Bowsher, Margaritis Voliotis, "Mutual Information and Conditional Mean Prediction Error", arxiv:1407.7165
H. Cai, S. R. Kulkarni and S. Verdu, "Universal Entropy Estimation via Block Sorting", IEEE Transactions on Information Theory 50 (2004): 1551--1561
C. J. Cellucci, A. M. Albano and P. E. Rapp, "Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms", Physical Review E 71 (2005): 066208
Ishanu Chattopadhyay, Hod Lipson, "Computing Entropy Rate Of Symbol Sources & A Distribution-free Limit Theorem", arxiv:1401.0711
J.-R. Chazottes, C. Maldonado, "Concentration bounds for entropy estimation of one-dimensional Gibbs measures", arxiv:1102.1816
J.-R. Chazottes and E. Uglade, "Entropy estimation and fluctuations of Hitting and Recurrence Times for Gibbsian sources", math.DS/0401093
Gabriela Ciuperca and Valerie Girardin, "Estimation of the Entropy Rate of a Countable Markov Chain", Communications in Statistics: Theory and Methods 36 (2007): 2543--2557
G. Ciuperca, V. Girardin and L. Lhote, "Computation and Estimation of Generalized Entropy Rates for Denumerable Markov Chains", IEEE Transactions on Information Theory 57 (2011): 4026--4034 [The estimation is just plugging in the MLE of the parameters, for finitely-parametrized chains, but they claim to show that works well]
Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
Peter Clifford and Ioana Ada Cosma, "A simple sketching algorithm for entropy estimation", AISTATS 2013 196--206, arxiv:0908.3961
Marshall Crumiller, Bruce Knight, Yunguo Yu and Ehud Kaplan, "Estimating the amount of information conveyed by a population of neurons" [PDF preprint via Dr. Kaplan]
Pawel Czyz, Frederic Grabowski, Julia E. Vogt, Niko Beerenwinkel, Alexander Marx, "Beyond Normal: On the Evaluation of Mutual Information Estimators", arxiv:2306.11078
J. M. Finn, J. D. Goettee, Z. Toroczkai, M. Anghel and B. P. Wood, "Estimation of entropies and dimensions by nonlinear symbolic time series analysis", Chaos 13 (2003): 444--456
Shuyang Gao, Greg Ver Steeg and Aram Galstyan
- "Estimating Mutual Information by Local Gaussian Approximation", UAI 2015
- "Efficient Estimation of Mutual Information for Strongly Dependent Variables", arxiv:1411.2003
Yun Gao, Ioannis Kontoyiannis, Elie Bienenstock
- "From the entropy to the statistical structure of spike trains", arxiv:0710.4117
- "Estimating the entropy of binary time series: Methodology, some theory and a simulation study", arxiv:0802.4363
M. N. Goria, N. N. Leonenko, V. V. Mergel and Pl L. Novi Inverardi, "A new class of random vector entropy estimators and its applications in testing statistical hypotheses", Journal of Nonparametric Statistics 17 (2005): 277--297
Peter Grassberger, "Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution," physics/0207023 [On Jimenez-Montano, Ebeling and Poeschel]
Jean Hausser and Korbinian Strimmer, "Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks", Journal of Machine Learning Research 10 (2009): 1469--1484
Detlef Holstein and Holger Kantz, "Optimal Markov approximations and generalized embeddings", Physical Review E 79 (2009): 056202
Wentao Huang and Kechen Zhang, "Information-Theoretic Bounds and Approximations in Neural Population Coding", Neural Computation 30 (2018): 885--944
Marcus Hutter, "Distribution of Mutual Information," cs.AI/0112019
Marcus Hutter, Marco Zaffalon, "Distribution of Mutual Information from Complete and Incomplete Data", Computational Statistics and Data Analysis 48 (2005): 633--657, arxiv:cs/0403025
Siddharth Jain, Rakesh Kumar Bansal, "On Large Deviation Property of Recurrence Times", ISIT 2013, arxiv:1303.1093
Jiantao Jiao, Kartik Venkat, Tsachy Weissman
- "Order-Optimal Estimation of Functionals of Discrete Distributions", arxiv:1406.6956
- "Maximum Likelihood Estimation of Functionals of Discrete Distributions", arxiv:1406.6959
Miguel Angel Jimenez-Montano, Werner Ebeling, and Thorsten Poeschel, "SYNTAX: A computer program to compress a sequence and to estimate its information content," cond-mat/0204134
David Källberg, Nikolaj Leonenko, Oleg Seleznjev, "Statistical Inference for Rényi Entropy Functionals", arxiv:1103.4977
Alexei Kaltchenko, "Algorithms for Estimating Information Distance with Applications to Bioinformatics and Linguistics", cs.CC/0404039
Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, III, Vladimir Protopopescu, and George Ostrouchov, "Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data", Physical Review E 76 (2007): 026209
John M. Konstantinides and Ioannis Andreadis, "Optimal Code Length Estimates From Dependent Samples With Bounds on the Estimation Error", IEEE communications Letters 20 (2016): 2358--2361
Nikolai Leonenko, Luc Pronzato and Vippal Savani, "A class of Rényi information estimators for multidimensional densities", Annals of Statistics 36 (2008): 2153--2182, arxiv:0810.5302
Annick Lesne, Jean-Luc Blanc and Laurent Pezard, "Entropy estimation of very short symbolic sequences", Physical Review E 79 (2009): 046208
Christophe Letellier, "Estimating the Shannon Entropy: Recurrence Plots versus Symbolic Dynamics", Physical Review Letters 96 (2006): 254102
Johan Lim, "Estimation of the Entropy Functional from Dependent Samples", Communications in Statistics: Theory and Methods 36 (2007): 1577--1589
Tiger W. Lin and George N. Reeke, "A Continuous Entropy Rate Estimator for Spike Trains Using a K-Means-Based Context Tree", Neural Computation 22 (2010): 998--1024
Han Liu, Larry Wasserman and John D. Lafferty, "Exponential Concentration for Mutual Information Estimation with Application to Forests", NIPS 2012
Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D. Nowak, Tsachy Weissman, "Concentration Inequalities for the Empirical Distribution", arxiv:1809.06522
Ilya Nemenman, William Bialek and Rob de Ruyter van Steveninck, "Entropy and information in neural spike trains: Progress on the sampling problem", 0306063
XuanLong Nguyen, Martin J. Wainwright, Michael I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization", arxiv:0809.0853
Yung-Kyun Noh, Masashi Sugiyama, Song Liu, Marthinus C. du Plessis, "Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence", Neural Computation 30 (2018): 1930--1960
Morteza Noshad, Yu Zeng, Alfred O. Hero III, "Scalable Mutual Information Estimation using Dependence Graphs", arxiv:1801.09125
Sebastian Nowozin, "Improved Information Gain Estimates for Decision Tree Induction", arxiv:1206.4620
Leandro Pardo, Statistical Inference Based on Divergence Measures
Liam Paninski, "Estimating Entropy on m Bins Given Fewer Than m Samples", IEEE Transactions on Information Theory 50 (2004): 2200--2203
Angeliki Papana and Dimitris Kugiumtzis, "Evaluation of Mutual Information Estimators for Time Series", arxiv:0904.4753
Paulo R. F. Pinto, M. S. Baptista, Isabel S. Labouriau, "Density of first Poincaré returns, periodic orbits, and Kolmogorov-Sinai entropy", arxiv:0908.4575
Barnabas Poczos), Zoubin Ghahramani, Jeff Schneider, "Copula-based Kernel Dependency Measures", arxiv:1206.4682
G. Pola, R. S. Petersen, A. Thiele, M. P. Young and S. Panzeri, "Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population", Neural Computation 17 (2005): 1962--2005
Paul K. Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, Ilya Tolstikhin, "Practical and Consistent Estimation of f-Divergences", arxiv:1905.11112
Avraham Ruderman, Mark Reid, Dario Garcia-Garcia, James Petterson, "Tighter Variational Representations of f-Divergences via Restriction to Probability Measures", arxiv:1206.4664
Luis G. Sanchez Giraldo, Murali Rao, Jose C. Principe, "Measures of Entropy from Data Using Infinitely Divisible Kernels", arxiv:1211.2459
Hailin Sang Yongli Sang Fangjun Xu, "Kernel Entropy Estimation for Linear Processes", Journal of Time Series Analysis 39 (2018): 563--591
Thomas Schürmann
- "Bias Analysis in Entropy Estimates", cond-mat/0403192
- "Scaling behaviour of entropy estimates," cond-mat/0203409
- "A Note on Entropy Estimation", Neural Computation 27 (2015): 2097--2106
J. F. Silva and S. Narayanan, "Complexity-Regularized Tree-Structured Partition for Mutual Information Estimation", IEEE Transactions on Information Theory 58 (2012): 1940--1952
Kumar Sricharan, "Ensemble Estimators for Multivariate Entropy Estimation", IEEE Transactions on Information Theory 59 (2013): 4374--4388
Kumar Sricharan, Alfred O. Hero III, "Ensemble estimators for multivariate entropy estimation", arxiv:1203.5829
Kumar Sricharan, Raviv Raich, Alfred O. Hero III
- "Empirical estimation of entropy functionals with confidence", arxiv:1012.4188
- "Estimation of Nonlinear Functionals of Densities With Confidence", IEEE Transactions on Information Theory 58 (2012): 4135--4159
Taiji Suzuki, Masashi Sugiyama and Toshiyuki Tanaka, "Mutual information approximation via maximum likelihood estimation of desnity ratio", ISIT 2009 [PDF preprint via Prof. Sugiyama]
Evgeniy Timofeev, Alexei Kaltchenko, "Nearest-neighbor Entropy Estimators with Weak Metrics", arxiv:1205.5856
Nicholas Watters and George N. Reeke, "Neuronal Spike Train Entropy Estimation by History Clustering", Neural Computation 26 (2014): 1840--1872
Yihong Wu, Pengkun Yang, "Minimax rates of entropy estimation on large alphabets via best polynomial approximation", arxiv:1407.0381
Alexander L Young, David B Dunson, "Efficient Entropy Estimation for Stationary Time Series", arxiv:1904.05850
Xianli Zeng, Yingcun Xia, and Howell Tong, "Jackknife approach to the estimation of mutual information", Proceedings of the National Academy of Sciences (USA) 115 (2018): 9956--9961
Zhiyi Zhang
- "Entropy Estimation in Turing's Perspective", Neural Computation 24 (2012): 1368--1389
- "A Normal Law for the Plug-in Estimator of Entropy", IEEE Transactions on Information Theory 58 (2012): 2745--2747
Zhiyi Zhang and Michael Grabchak, "Nonparametric Estimation of Küllback-Leibler Divergence", Neural Computation 26 (2014): 2570--2593 [Kullback did not spell his name with an umlaut]