Complexity, Entropy and the Physics of gzip02 Dec 2003 13:38
Recently, a lot of papers have been published in the physics literature where people try to use standard file-compression algorithms, particularly the gzip implementation of the Lempel-Ziv algorithm, to estimate the algorithmic information content (a.k.a. Kolmogorov complexity) of data-sources, and/or their entropy rates. These are both very bad ideas. [I will add references to the offending papers later.] My apologies, before we go any further, to the actual "complexity, entropy and physics of information" group, which knows better than this!
I wrote a dead-simple program which uses an additive pseudo-random number generator to produce a sequence of ASCII characters, either 0 or 1. The size of the program, compiled on my desktop Linux box, was 15574 bytes. This includes a particular fixed seed for the random number generator. The program took one input, which was the number of characters N to output, which can be encoded in (approximately) log(N) bits. So the algorithmic information content of the output is at most 15574 + (1/8)log(N) bytes. (All logarithms are of course to base 2.)
Having generated pseudo-random strings from this program, I then ran gzip on them. One would hope for a compression ratio of approximately 1/8, at least for large strings, because ASCII uses eight bits per character, whereas there are really only two possibilities in the output. (A better comparison would've been for me to write directly to a binary file, without using ASCII, but I'm lazy; I'll get to that later.) The original data can then be regenerated from the combination of the compressed file and the gunzip program. On my system, that program is 53716 bytes in size. So the "zipper" estimate of the algorithmic information content of the output of my pseudo-random number generator is as follows. (I have rounded up all fractions to the next largest byte, to try and reduce the advantage of the actual generating program.)
I didn't run it beyond that size, because my sysadmins would not be especially happy with my producing a a gigabyte of garbage to prove this point; some other time.
What this shows is that, even with a hundred million data points, gzip is not very good at all at finding a short compression of what is, after all, a linear dynamical system with some fairly simply nonlinearities added on top. As an estimator of the Kolmogorov complexity, it sucks rocks.
Another purported use of gzip is to estimate the entropy rate of the source. This is done either by dividing the length of the gzipped output by N (call this the division methd), or as the incremental increase in the length of the compressed file when increasing N by 1 (call this the marginal method). To be fair to the marginal method, one should really look at the increase for an additional m characters, divided by m, so I do this for m = 10, m = 100 and m = 1000. My pseudo-random number generator is periodic, so its entropy rate is exactly 0. Here is how both of these methods ("division" and "marginal", respectively) did at estimating 0 bytes per character (to three decimal places, rounded):
Note that even if one pretended this was a genuine source of independent, identically-distributed coin flips, the entropy rate should be 0.125 bytes per character, not approximately 0.16 bytes. It is exceedingly improbably that I've generated a sample which happens to look like it has an entropy rate that high. (Back-of-the-envelope large deviations calculation: the entropy of the empirical distribution relative to the true one would be on the order of 0.159-0.125 = 0.034 bytes per symbol, or 0.272 bits, so the probability would be approximately 2^(-0.272 N) = 10^(-0.082 N) for large N. This is a little dodgy because I should really search for the relative-entropy-minimizing distribution, but, again, this is a first cut.) Now, Lempel-Ziv did prove that, asymptotically, the compressed length divided by the number of characters will be no larger than the entropy rate. So the division method, at least, should asymptote to the entropy rate, and you can persuade yourself that this means the marginal method then should converge too. But the key word here, again and again, is asymptotically: we have here a hundred million data points, but are evidently very, very far from the asymptote. I for one do not find it plausible that vastly smaller empirical data sets reach the asymptotic regime.
There are, of course, other ways of measuring quantities like the entropy rate. As it happens, I was involved in writing a program (CSSR) for building hidden Markov models from sequential data, and part of the output of the program is a calculation of the entropy rate. Here is what CSSR gives when it's configured to just fit a simple Markov chain to the data (Lmax = 1); results aren't notably different if I allow it to seek higher-order models.
The point, of course, is not that Lempel-Ziv is a bad compression algorithm --- it's not! --- but that it was never designed to be used to estimate entropy rates (which can be done better using other tools) or algorithmic complexities (which in generally cannot be done). The best screwdriver in the world makes a very bad butter-knife.
A possible objection is that a pseudorandom number generator is, of course, hard, and gives a much more convincing appearance of having no structure than one might expect from typical data-sets. So, at some convenient later time, I'm going to try to supplement these (elementary) results with ones from some non-trivial dynamical system, and show that gzip doesn't do any better there, either.
Update, 2 December 2003: There has been a surprisingly large (and positive) response to this. Matthew Berryman points out that gzip includes some metadata in its compressed files, which presumably should be discounted; very true, but it won't affect the results substantially. More seriously, he reminds me that the Lempel-Ziv convergence result requires not just that the length of the data set grow without bound, but that the length of the phrases into which it is parsed must be unbounded, and that gzip is implemented with a fixed maximum phrase length, considerably smaller than 10^8. I hadn't realized there was a limit on the phrase length, and suspect this is why the large-N entropy rate is so bad. A different implementation of Lempel-Ziv, without that limit, would do better at that, but that doesn't help the published papers using gzip, and it's still a dumb thing to do. Derek Abbott suggests that the whole issue could be avoided by replacing Shannon entropy with gz entropy, defined as the result of applying gzip. This elegant suggestion "has all the advantages of theft over honest toil", and accordingly I urge Derek to submit it to Physical Review Letters before somebody beats him to this prize. Kirk Stork wants to know what kind of results I'd get on the random bytes available from random.org, produced by sampling actual atmospheric noise; I'll post those here once I do that. Several people suggested using the logistic map at parameter settings where the true entropy rate is known (to some accuracy); this will be easy. Finally, a number of people, whose names I elide out of courtesy, suggested replicating this analysis on the output of the postmodernism generator.
Update, 16 April 2015: I should of course have given references to actual papers making these points. The first which come to mind are Khmelev and Teahan's "On an Application of Relative Entropy" (arxiv:cond-mat/0205521) and Goodman's "Extended Comment on 'Language Trees and Zipping'" (arxiv:cond-mat/0202383).
See also: Estimating Entropies and Informations