Calgary Compression Challenge

This is a continuation (without prize money) of the Calgary Compression Challenge, a contest run by Leonid A. Broukhis from May 21, 1996 through May 21, 2016. The goal of the contest is to produce the smallest possible archive containing either the 14 file Calgary corpus, or a program that when run taking input from only other files in the archive (if any), outputs the 14 file Calgary corpus.

Leaderboard

SizeDateAuthor
759881Sep 1997Malcolm Taylor
692154Aug 2001Maxim Smirnov
680558Sep 2001Maxim Smirnov
653720Nov 2002Serge Voskoboynikov
645667Jan 10, 2004Matt Mahoney
637116Apr 2, 2004Alexander Rhatushnyak
608980Dec 31, 2004Alexander Rhatushnyak
603416Apr 4, 2005Przemysław Skibiński
596314Oct 2005Alexander Rhatushnyak
593620Dec 3, 2005Alexander Rhatushnyak
589863May 2006Alexander Rhatushnyak
580170Jul 2, 2010Alexander Rhatushnyak

Rules

Submissions must improve on the previous best result by at least 1000 bytes.

An archive is a file or set of files that may be processed by any of the following: unzip, bunzip2, unrar, or or PPMd var. I. Effective June 1, 2017, 7zip, and zpaq. are also allowed. If submitting more than one file, then the size of the archive is calculated as the sum of the file sizes, plus the lengths of the file names, plus 4 bytes per file.

A program is a 32 or 64 bit Linux or Windows executable program or a source program written in C, C++, or Perl. It must run to completion in 6 hours or less on a Core i7 M620 with 4 GB memory. If the archive contains one or more other files, then the program will be run once for each file with the file name passed as a command line argument. Otherwise it will be run with no arguments. The program must not take any input other than from the file whose name is passed to it.

The Calgary corpus is the following set of 14 files:

Size    Name
------- ------
111,261 bib
768,771 book1
610,856 book2
102,400 geo
377,109 news
 21,504 obj1
246,814 obj2
 53,161 paper1
 82,199 paper2
513,216 pic
 39,611 progc
 71,646 progl
 49,379 progp
 93,695 trans
The concatenation of these files in alphabetical order by name (as shown) to a single file of size 3,141,622 bytes has the following hashes (as computed by fsum):
md5       1b62b5d5c9536368b0b691fd9a41a536
sha-1     937b489e26962b094aff0547e7b34c02eac1b0f5
sha-256   3a1586fb28c0d9b767e561b604092ce73336cd3eedc5df0f29c9db1a63f0f124

I reserve the right to change these rules or to reject submissions not in keeping with the spirit of the contest.

Send your submission to Matt Mahoney at mattmahoneyfl at gmail.com. If accepted, I will post it and add your name to the leaderboard.

History

May 19, 2016. Created this page in taking over the contest.

Apr. 3, 2017. Added 7zip and zpaq to the list of allowable archive formats, effective June 1, 2017. A decompression program is optional. Calgary corpus defined.