Friday, April 23, 2010

Dembski and data compression

One of Dembski's approaches to determining whether a set of data is the result of design is whether it is compressible. Thus, the series of alleged dice throws 1111111111 is suspicious, while 4124262422 is not suspicious. One of Dembski's explanations is that the former can be easily compressed (e.g., with a run length encoding, say "1*10") while the latter cannot. McGrew offers the following objection: "We can tell instantly that novels and software code are the products of intelligent agency, though neither War and Peace nor Microsoft Word is algorithmically compressible." This is embarrassingly false. War and Peace and Microsoft Word are algorithmically compressible. For instance, take Microsoft Word:

$ wc -c WINWORD.EXE
12314456 WINWORD.EXE
$ bzip2 < WINWORD.EXE | wc -c
6386617
So, yes, Microsoft Word compressed by about a half. How about War and Peace?
$ wc -c WarAndPeace.txt
3288738 WarAndPeace.txt
$ bzip2 WarAndPeace.txt
$ wc -c WarAndPeace.txt.bz2
884546 WarAndPeace.txt.bz2
So, the compressed version is 27% of the original. Oops! Seems like Dembski's criterion works for Word and War and Peace.

However, we can easily make Dembski's criterion fail. I'll just do it with War and Peace because it's out of copyright.[note 1]

$ mv WarAndPeace.txt.bz2 WarAndPeace.compressed
$ bzip2 WarAndPeace.compressed
$ wc -c WarAndPeace.compressed.bz2
887810 WarAndPeace.compressed.bz2
So, when I try to compress the compressed version of War and Peace, I get a result that's 0.3% larger. In other words, the compressed version of War and Peace fails the Dembski criterion. Obviously, compression cannot always be iterated successfully, or we'd compress every finite text to nothing. But my WarAndPeace.compressed file is just as much the product of intelligent design as WarAndPeace.txt. In fact, it is the product of a greater amount of design: there is Tolstoy's authorship, and there is the Julian Seward's design of the bzip2 algorithm.

Now, could there be an algorithm that could compress my WarAndPeace.compressed file? No doubt. For instance, I could decompress it with bunzip and then apply a more efficient compression algorithm, like LZMA. However, there is a limit to this approach.

6 comments:

あじ said...

So Dembski's argument depends in part on the use of a simple compression algorithm? I suppose that makes some sense, but it seems to make criteria for falsification rather fuzzy. This makes sense in an abstract mathematical sense, but how does it play out when considering actual chemical and biological composition? There are real-life limitations that raw numbers can easily over-simplify.

BTW, based on those numbers, it looks like Microsoft should be using UPX...

Alexander R Pruss said...

I don't think he's using a single simple compression algorithm. I think the idea is that something counts as compressible iff there exists a relatively simple compression algorithm that compresses it significantly. To get this right, you'd need to have some way of quantifying the complexity of the compression algorithm.

Here's roughly how I see Dembski's criterion for design: A sequence s is designed if (a) s is not naturally explainable and (b) there is some description D in a canonical language (hard to define the canonicity!) such that s is the unique sequence satisfying D and Length(D) is much less than Length(s).

With current hard drive sizes, it doesn't seem worthwhile to pack a 12mb executable. Amusingly, the first computer my family had (an Olivetti M24) had a 20mb hard drive.

BernardZ said...

PI (3.14159..) would be not compressible but it is designed.

Alexander R Pruss said...

pi is compressible: it can be wholly described by the finite sentence: "The ratio of the circumference to the diameter of a circle in Euclidean geometry." (Where "in Euclidean geometry" abbreviates all the relevant axioms and definitions.)

Crude said...

Alex,

I'm curious of one thing. Dembski admits outright that his methods for inferring design can return false negatives. So wouldn't your result here be a pretty modest criticism at best?

Also, just to be picky (and I myself am very skeptical of ID, at least in the sense of being about to prove/disprove design with science) - I think Dembski may say infer rather than determine. It seems like a minor point, but there does seem to be a world of difference between a claim that we can infer (even strongly infer) that something was designed, and, say.. know with certainty said thing designed.

Alexander R Pruss said...

So maybe McGrew and I are both being unfair to Dembski here: we only have an argument against the necessity of Dembski's criterion, but what he is interested in is sufficiency.

Still, I think the McGrew point is that some incompressible sets of data are so obviously designed that Dembski can't claim that his criterion captures our intuitive methods of judgment of design.