Large Text Compression Benchmark

Matt Mahoney
Last update: July 28, 2008. history
Latest version: http://cs.fit.edu/~mmahoney/compression/text.html
Mirror: http://mattmahoney.net/text/text.html (may be older)

This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 109 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006. About the test data.

The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability strings like reckon eyes peach. Rationale.

This is an open benchmark. Anyone may contribute results. Please read the rules first.

Compression improvements to the first 108 bytes are eligible for the Hutter Prize, with 50,000 euros of funding.

Benchmark Results

Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompressor. Options are selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking. Notes about the table:

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
-------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
durilca4linux_3 v3 -m3600 -o14 -t2            16,292,414  129,469,384    339,990 xd 129,809,374   3624  3627 4600 PPM  18
paq8hp12any       -8                          16,230,028  132,045,026    330,700 x  132,375,726  56993       1850 CM   22
drt|lpaq9i        9                           18,065,347  144,752,858    110,149 x  144,863,007   2486  2501 1542 CM
xwrt 3.2|ppmonstr J (note 13)                 18,456,706  148,915,761     79,404 sx 148,995,165   2987  2546 1650 PPM
xwrt 3.2    -l14 -b255 -m96 -s -e40000 -f200  18,679,742  151,171,364     52,569 s  151,223,933   2537  2328 1691 CM

nanozip 0.01a     -cc -m1500m -nm -forcemem   18,723,413  152,654,332    266,797 x  152,921,129   3147  3091 1556 CM
WinRK 3.03        pwcm +td 800MB SFX          18,612,453  156,291,924     99,665 xd 156,391,589  68555        800 CM   10
ppmonstr J        -m1700 -o16                 19,055,092  157,007,383     42,019 x  157,049,402   3574 ~3600 1700 PPM
slim 23d          -m1700 -o12                 19,077,276  159,772,839     69,453 x  159,842,292   5232 ~5400 1700 PPM
bbb               m1000                       20,847,290  164,032,650     11,227 s  164,043,877   4524  2619 1401 BWT
10

paq9a             -9                          19,974,112  165,193,368     13,749 s  165,207,117   3997  4021 1585 CM
uda 0.300                                     19,393,460  166,272,261     11,264 x  166,283,525  25282 25174  180 CM
nanozipltcb                                   20,494,670  166,251,135    239,124 x  166,490,259    348   185 1729 BWT
cmm4 v0.1e        96                          20,569,034  172,669,955     31,314 x  172,701,269   2052  2056 1321 CM
ccmx 1.30         7                           20,857,925  174,142,092     15,014 x  174,157,106   1313  1338 1332 CM

epmopt|epm r9     -m800 -n20 --fixedorder:12  19,713,502  174,817,424    141,101 x  174,958,525   3179  3376  800 PPM
WinUDA 2.91       mode 3 (194 MB)             20,332,366  174,975,730     17,203 x  174,992,933  23610 23473  194 CM
dark 0.51         -b333mf                     21,169,819  175,471,417     34,797 x  175,506,214    533   453 1692 BWT
FreeArc 0.40pre-4 -mppmd:1012m:o13:r1         20,931,605  175,254,732    748,202 x  176,002,934   1175  1216 1046 PPM
hook v1.0         1700                        22,122,484  177,843,658     11,163 x  177,854,821    865   879 1739 DMC
20

7zip 4.46a        -m0=ppmd:mem=1630m:o=10 ... 21,197,559  178,965,454          0 xd 178,965,454    503   546 1630 PPM  23
M99 v2.1          e -m 239m                   21,251,170  178,910,174     68,052 x  178,978,226    713   535 1500 BWT
ash 04a           /m700 /o10                  19,963,105  180,735,542     11,137 x  180,746,679   6100  5853  700 CM
pimple2                                       20,871,457  180,251,530     78,642 x  180,330,172  18474 17992  128 CM
ocamyd LTCB 1.0   -s0 -m3                     21,285,121  182,359,986     21,030 x  182,381,016 108960~110000 300 DMC   6

bee 0.79 b0154    -m3 -d8                     20,975,994  182,373,904     57,046 x  182,430,950   9295  9285  512 PPM
uhbc 1.0          -m3 -b100m                  20,930,838  182,918,172     56,242 x  182,974,414   1569   809  800 BWT
ppmd J1           -m256 -o10 -r1              21,388,296  183,964,915     11,099 s  183,976,014    880   895  256 PPM
tc 5.2 dev 2                                  21,481,399  184,939,711     41,112 x  184,980,823   3637  3655  230 CM
ppmvc v1.1        -m256 -o8 -r1               21,484,294  186,208,405     25,241 x  186,233,646    898   913  272 PPM
30

chile 0.4         -b=244141                   22,218,917  186,979,614     11,530 s  186,991,144   2513   512 1426 BWT
bit 0.2b          -m lwcm -mem 9              21,971,587  189,881,180     63,665 x  189,944,845   2708  2747  1052 CM
CTXf 0.75 pre b1  -me                         22,072,783  191,008,871     57,337 x  191,066,298   1112  1037   78 PPM
rings 1.5         9                           21,848,093  191,067,972     44,565 x  191,112,537    172   189  426 BWT
m03exp 2005-02-15 32MB blocks                 21,948,192  191,250,500     44,593 x  191,295,093  ~4800 ~2100  256 BWT

Stuffit 12.0.0.17 -m=4 -l=16 -x=30            22,105,654  190,372,707  2,658,122 xd 193,030,829    628   658 1062 PPM
enc 0.15          aq                          22,156,982  195,604,166     94,888 x  195,699,054   6843  6868   50 CM
sbc 0.970r2       -ad -m3 -b63                22,470,539  197,066,203     99,094 xd 197,165,297   1733   313  224 BWT
WinRAR 3.60b3     -mc7:128t+ -sfxWinCon.sfx   22,713,569  198,454,545          0 xd 198,454,545    506   415  128 PPM
quark v0.95r beta -m1 -d25 -l8                22,988,924  198,600,023     80,264 x  198,680,287  27952   217  534 LZ77
40

bssc 0.95 alpha   -b16383                     23,117,061  201,810,709     45,489 x  201,856,198    578   217  140 BWT   4
uharc 0.6b        -mx -md32768                23,911,123  208,026,696     73,608 xd 208,100,304   1666  1330   50 PPM
GRZipII 0.2.4     -b8m                        23,846,878  208,993,966     41,645 s  209,035,641    312   216   58 BWT
4x4 0.2a          4t (grzip:m1:h18)           23,833,244  208,787,642    317,097 x  209,104,739    386   240  269 BWT
rzm 0.07h                                     24,361,070  210,126,103     17,667 x  210,143,770   2336    81  160 ROLZ

pim 2.50          best                        24,303,638  210,124,895    330,901 x  210,455,796    764  ~764   88 PPM
CTW 0.1           -d6 -n16M -f16M             23,670,293  211,995,206     43,247 x  212,038,452  19221 19524  144 CM
boa 0.58b         -m15                        24,322,643  213,845,481     55,813 x  213,901,294   3953 ~4100   17 PPM
TarsaLZP Aug  8 2007                          25,134,862  215,301,412      2,843 xd 215,304,255    249   287  341 LZP
LZPXj 1.2h        9                           25,205,783  217,880,584      4,853 s  217,885,437    783   717 1316 PPM  
50

scmppm 0.93.3     -l 9                        25,198,832  217,867,392     37,043 s  217,904,435    708   644   20 PPM
PX v1.0                                       24,971,871  219,091,398      3,054 s  219,094,452   1838  1809   66 CM    3
DGCA 1.10         default+SFX                 25,203,248  219,655,072          0 xd 219,655,072    858   270   76
Squeez 5.20.4600  sqx2.0 32MB Ultra           25,118,441  220,004,873     91,019 xd 220,095,892   2575   116  365
fpaq2                                         25,287,775  221,242,386      3,429 s  221,245,815  20183 20186  131 CM

dmc               c 1800000000                25,320,517  222,605,607      2,220 s  222,607,827    676   721 1800 DMC
balz 1.13         ex                          26,421,416  228,337,644     49,024 x  228,286,668   3700   190  206 ROLZ
lzpm 0.11         9                           26,501,542  229,083,971     46,824 x  229,130,795  15395    57  740 ROLZ
qazar 0.0pre5     -l7 -d9 -x7                 26,455,170  229,846,871     71,959 x  229,918,830   5738   903  105 LZP
flashzip 0.9      -m2 -s7 -b5                 26,737,801  230,987,395     30,052 x  231,017,447   2476    75  132 ROLZ
60

lzturbo 0.9       -59                         26,616,278  232,701,587    116,508 x  232,818,095   1420    52  248 LZ77
qc 0.050          -8                          26,763,343  232,784,501     46,100 x  232,830,601   8218  1503  151
WinTurtle 1.60    512 MB buffer               28,379,612  245,217,944    160,090 x  245,378,034    273   237  583 PPM
cabarc 1.00.0601  -m lzx:21                   28,465,607  250,756,595     51,917 xd 250,808,853   1619    15   20 LZ77
sr3                                           28,926,691  253,031,980      5,611 x  253,037,591    130   146   68 SR

bzip2 1.0.2       -9                          29,008,736  253,977,839     30,036 x  254,007,875    379   129    8 BWT
quad v1.11        -x                          29,110,579  256,145,858     13,387 s  256,159,245    956   116   34 ROLZ
WinACE            -sfx -m5 -d4096             29,481,470  257,237,710          0 xd 257,237,710   1080    77    4
tornado 0.4a      -11                         30,157,610  258,761,459     42,516 s  258,803,975    783    25 1513 LZ77
lzc v0.08         10                          30,611,315  266,565,255     11,364 x  266,576,619    302    63  550 LZ77
70

packet 0.90b      -m4 -s9                     31,208,752  273,176,127     32,305 x  273,208,432   3871    48   10 LZ77
ha 0.98           a2                          31,250,524  285,739,328     28,404 x  285,767,732   2010  1800  0.8 PPM
lcssr 0.2         -b7 -l9                     34,549,048  296,160,661      8,802 x  296,169,463   8186  8281 1184 SR
slug 1.27                                     35,093,954  309,201,454      6,809 x  309,208,263     32    28   14 ROLZ
kzip May 13 2006  /b1024                      35,016,649  310,188,783     29,184 xd 310,217,967   6063    62  121 LZ77  2

uc2 rev 3 pro     -tst                        35,384,822  312,767,652    123,031 x  312,890,683    360    63    4 LZ77
thor 0.95         e4                          35,795,184  314,092,324     49,925 x  314,142,249     64    34   16 LZP
gzip124hack 1.2.4 -9                          36,273,716  321,050,648     62,653 x  321,113,301    149    19    1 LZ77 
gzip 1.3.5        -9                          36,445,248  322,591,995     38,801 x  322,630,796    101    17  1.6 LZ77
Info-ZIP 2.3.1    -9                          36,445,373  322,592,120     57,583 x  322,649,703    104    35  0.1 LZ77
80

pkzip 2.0.4       -ex                         36,556,552  323,403,526     29,184 xd 323,432,710    171    50  2.5 LZ77
jar (Java) 0.98-gcc  cvfM                     36,520,144  323,747,582     19,054 x  323,766,636    118    95  1.2 LZ77
PeaZip            better, no integrity check  36,580,548  323,884,274    561,079 x  324,445,353    243   243    8 LZ77 20
pucrunch          -d -c0                      39,199,165  350,265,471     34,359 s  350,299,830   2649   463    2 LZ77
lzop v1.01        -9                          41,217,688  366,349,786     54,438 x  366,404,224    289    12  1.8 LZ77

lzw 0.2                                       41,960,994  367,633,910        671 s  367,634,581   3597    31   18 LZW
arbc2z                                        38,756,037  379,054,068      6,255 sd 379,060,323   2659  2674   68 PPM
lzgt1                                         43,928,072  403,385,292      2,025 sd 403,387,317   3390   865    2 LZ77
srank 1.1         -C8                         43,091,439  409,217,739      6,546 x  409,224,285     51    45    2 SR
QuickLZ 1.30b     (quick3)                    46,378,438  410,633,262     44,202 x  410,677,464     48    12    3 LZ77
90

compress 4.3d                                 45,763,941  424,588,663     16,473 x  424,605,136    103    70  1.8 LZW
BriefLZ 1.05                                  46,638,341  425,384,313      5,298 x  425,389,611     66    18    2 LZ77
lzrw3-a                                       48,009,194  438,253,704      4,750 x  438,258,454     38    17    2 LZ77
fcm1                                          45,402,225  447,305,681      1,116 s  447,306,797    228   261    1 CM1
FastLZ Jun 12 2007                            54,658,924  493,066,558      7,065 xd 493,073,623     18    13    1 LZ77

flzp v1                                       57,366,279  497,535,428      3,942 s  497,539,370     78    38    8 LZP
fpaq0f2                                       56,916,872  558,645,708      3,066 x  558,648,769    222   207  0.4 o0
ppp                                           61,657,971  579,352,307      1,472 s  579,353,779     80    59    1 SR
shindlet_fs                                   62,890,267  637,390,277      1,275 xd 637,391,552    113   103  0.6 o0
arb255                                        63,501,996  644,561,595      4,871 sd 644,566,466   2551  2574  1.6 o0
100

compact                                       63,862,371  648,370,029      3,600 sd 648,373,629    216   164  0.2 o0
barf              (2 passes)                  76,074,327  758,482,743    983,782 s  759,466,525    756    53    4 LZ77
arb2x v20060602                               99,642,909  995,674,993      3,433 sd 995,678,426   2616  2464  1.6 o0b

Fails on enwik9

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
-------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
hipp 5819         /o8                         20,555,951  (fails)         36,724 x                5570  5670  719 CM
XMill 0.8         -w -P -9 -m800              26,579,004  (fails)        114,764 xd                616   530  800 PPM
lzp3o2                                        33,041,439  (fails)         23,427 xd                230   270  151 LZP

Programs that properly decompress enwik8 and don't use external dictionaries are still eligible for the Hutter Prize.

Testing not yet completed

                       Compression               Compressed size      Decompressor  Total size   Time (ns/byte)
Program                  Options                enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
-------                  -------              ----------  -----------  -----------  -----------  ----- -----  --- --- ----
rdmc 0.06b                                    33,181,612                                          1394  1381      DMC  6
ESP v1.92                                     36,651,292                                           223            LZ77 16

Notes about compressors

I only test the latest supported version of a program. I attempt to find the options that select the best compression, but will not generally do an exhausitve search. If an option advertises maximum compression or memory, I don't try the alternatives. If you know of a better combination, please let me know. I will select the maximum memory setting that does not cause disk thrashing, usually about 1800 MB. If the compressor is not downloadable as a zip file then I will compress the source or executable (whichever archive is smaller) plus any other needed files (dictionaries) into a single zip archive using 7zip 4.32 -tzip -mx=9. If no executable is available I will attempt to compile in C or C++ (MinGW 3.4.2, Borland 5.5 or Digital Mars), Java 1.5.0, MASM, NASM, or gas.

1. Reported by Guillermo Gabrielli, May 16, 2006. Timed on a Celeron D325 2.53Ghz Windows XP SP2 256MB RAM.
2. Decompression size and time for pkzip 2.0.4. kzip only compresses.
3. Reported by Ilia Muraviev (author of PX, TC, pimple), June 10-July 18, 2006. Timed on a P4 3.0 GHz, 1GB RAM, WinXP SP2.
4. enwik9 reported by Johan de Bock, May 19, 2006. Timed on Intel Pentium-4 2.8 GHz 512KB L2-cache, 1024MB DDR-SDRAM.
5. Compressed with paq8h (VC++ compile) and decompressed with paq-8h (Intel compile of same source code). Normally compression and decompression are the same speed.
6. ocamyd 1.65.final and LTCB 1.0 reported by Mauro Vezzosi, May 30-June 20, 2006. Timed on a 1.91 GHz AMD Athlon XP 2600+, 512 MB, WinXP Pro 2002 SP2 using timer 3.01. ocamyd 1.66.final reported Feb. 3, 2007. Times are process times.
7. Under development by Mauro Vezzosi, May 24, 2006.
8. Reported by Denis Kyznetsov (author of qazar), June 2, 2006.
9. Reported by sportman, May 24, 2006. Timed on a Intel Pentium D 830 dual core 3.0GHz, 2 x 512MB DDR2-SDRAM PC4300 533Mhz memory timing 4-4-4-12 (833.000KB free), Windows XP Home SP2. CPU was at 52% so apparently only one of 2 cores was used. Decompression verified on enwik8 only (not timed, about 2.5 hours). WinRK compression options: Model size 800MB, Audio model order: 255, Bit-stream model order: 27, Use text dictionary: Enabled, Fast analyses: Disabled, Fast executable code compression: Disabled
10. Reported by Malcolm Taylor (author of WinRK), May 24, 2006. Timed on an Athlon X2 4400+ with 2GB, running WinXP 64. Decompression not tested. Decompressor size is based on SFX stub size reported by Artyom (A.A.Z.), Sept. 2, 2007, although it was not tested this way.
11. Reported by sportman, May 25, 2006. CPU as in note 9.
12. Reported by sportman, May 30, 2006. CPU as in 9 (50% utilized).
13. xwrt 3.2 options are -2 -b255 -m250 -s -f64. ppmonstr J options are -o10 -m1650.
14. Reported by Michael A Maniscalco, June 15, 2006.
15. Reported by Jeremiah Gilbert on the Hutter group, Aug. 18, 2006. Tested under Linux on a dual Xeon 1.6 GHz(lv) (overclocked to 2.13 GHz) with 2 GB memory. Time is user+sys (real=196500 B/ns).
16. Reported by Anthony Williams, Aug. 19-22. 2006. Timed on a 2.53 GHz Pentium 4 with 512 MB under WinXP Home SP2.
17. Tested Aug. 20, 2006 under Ubuntu Linux 2.6.15 on a 2.2 GHz Athlon-64 with 2 GB memory. Time is approximate wall time due to disk thrashing. User+sys time is 153600 ns/byte compress, 148650 decompress.
18. Reported by Dmitry Shkarin (author of durilca4linux), Aug. 22-23, 2006 for durilca4linux_1; and Oct. 16-18, 2006 for durilca4linux_2. 3 GB memory usage is RAM + swap. Tested on AMD Athlon X2 4400+, 2.22 GHz, 2 GB memory under SuSE Linux AMD64 v10.0. durilca4linux_3 reported Feb. 21, 2008 using 4 GB RAM + 1 GB swap. v2 reported Apr. 22, 2008. v3 reported May 22, 2008.
19. enwik8 confirmed by sportman, Sept. 20, 2006. Compression time 61480 ns/byte timed on a 2 x dual core (only one core active) Intel Woodcrest 2GHz with 1333MHz fsb and 4GB 667MHz CL5 memory under SiSoftware Sandra Lite 2007.SP1 (10.105). Drystone ALU 37,014 MIPS, Whetstone iSSE3 25,393 MFLOPS, Integer x8 iSSE4 220,008 it/s, Floating-point x4 iSSE2 119,227 it/s.
20. Reported by Giorgio Tani (author of PeaZip) on Nov. 10, 2006. Tested on a MacBook Pro, Intel T2500 Core Duo CPU (one core used), with 512 MB memory under WinXP SP2. Time is combined compression and decompression.
21. enwik9 -8 reported by sportman, Dec. 12-13, 2006. Hardware as note 19. enwik9 decompression not verified. paq8hp7 -8 enwik8 compression was reported as 16,417,650 (4 bytes longer; the size depends on the length of the input filename, which was enwik8.txt rather than enwik8). I verified enwik8 -7 and -8 decompression.
22. paq8hp8 -8 enwik9 reported by sportman, Jan. 18, 2007. paq8hp10 -8 enwik9 on Apr. 2, 2007. paq8hp11 -8 enwik9 on May 10, 2007. paq8hp12 -8 enwik8/9 on May 20, 2007. Hardware as in note 19. Decompression verified for enwik8 only.
23. 7zip 4.46a options were -m0=PPMd:mem=1630m:o=10 -sfx7xCon.sfx
24. paq8o8-intel (intel compile of paq8o8) -1, paq8o8z-jun7 (DOS port of paq8o8) -1 reported by Rugxulo on Jun 10, 2008. Timed on a AMD64x2 TK-53 Tyler 1.6 GHz laptop with Vista Home Premium SP1. paq8o8z tested under FreeDOS.
25. paq8o8z -1 enwik8 (DJGPP compile) reported by Rugxulo on Jun 17, 2008. Tested on a 2.52 Ghz P4 Northwood, no HTT, WinXP Home SP2.

I have not verified results submitted by others. Timing information, when available, may vary widely depending on the test machine used.

About the Compressors

The numbers in the headings are the compression ratios on enwik9.

.1298 durilca

durilca and durilca'light 0.5 by Dmitry Shkarin (Apr. 1, 2006) are closed source, experimental command line file compressors based on ppmd/ppmonstr with filters for text, exe, and data with fixed length records (wav, bmp, etc). durilca'light is a faster version with less compression. Unfortunately both crash on enwik9. Decompression is verified on enwik8.

The -m700 option selects 700 MB of memory. (It appears to use substantially more for enwik9 according to Windows task manager). -o12 selects PPM order 12 (optimal for enwik9 -t0). -t0 (default) turns off text modeling, which hurts compression but is necessary to compress enwik9 (although decompression still crashes). -t2(3) turns on text preprocessing (dictionary; thus the increased decompressor size). -t2 also supports 3 additive flags (4, 8, 16) which have no effect on this data, thus -t2(31) or -t2 (default is 31) give the same compression as -t(3).

durilca 0.5(Hutter) was released 1457Z Aug. 16, 2006. It does not use external dictionaries. When run with 1 GB memory (-m700), -o13 is optimal. With 2 GB (-m1650), -o21 is optimal. The unzipped .exe file is 86,016 bytes.

durilca4linux_1 (0825Z Aug 23 2006) is a Linux version of durilca 0.5(Hutter) which successfully compresses enwik9 and decompresses with UnDur (23,375 bytes zipped, 42,065 bytes uncompressed). All versions of durilca require memory specified by -m plus memory to read the input file into memory. In Windows, this exceeds the 2 GB process limit regardless of available RAM and swap. Thus, enwik9 compresses only under Linux with 2 GB real memory and 1 GB additional swap. The -o12 option is optimal for enwik9 (tested under 64 bit SuSE 10.0 by the author), -o24 for enwik8 (verified by me under 64 bit Ubuntu 2.6.15).

durilca4linux_2 (Oct. 16, 2006) is a closed source Linux version specialized for this benchmark. It includes a warning that use on other files may cause data loss. It requires AMD64 Linux and 3 GB of memory (2 GB for enwik8). The decompressor files (EnWiki.dur and UnDur) are contained within a 241,322 byte zip file in the rar distribution. To compress:

  ./DURILCA d EnWiki.dur
  ./DURILCA e -m1800 -o10 -t2 enwik9
To decompress:
  ./UnDur EnWiki.dur
  ./UnDur enwik9.dur
The first step extracts a compressed dictionary. It is organized in a similar manner to paq8hp2-paq8hp5 in that syntactically related words and words with the same suffix are grouped together. Results are reported by the author under Suse Linux 10.0. I verified enwik8 only (6480 ns/b to compress on a 2.2 GHz Athlon 64 with 2 GB memory under Ubuntu Linux). enwik9 caused disk thrashing.

durilca4linux_3 (dictionary version v1) was released Feb. 21, 2008. Like version 2, it requires extraction of EnWiki.dur before compressing or decompressing, and may not work with files other than enwik8 and enwik9. As tested, requires 64-bit Linux, 4 GB RAM, and 5 GB RAM+swap.

undur3 v2 contains an improved dictionary (version v2), released Apr. 22, 2008, for DURILCA4Linux_3. The compression and decompression programs are the same. The decompression program UnDur (Linux executable) is included. To compress, download durilca4linux_3 and replace the dictionary (EnWiki.dur) with this one. The options are -m3600 (3600 MB memory), -o14 (order 14 PPM), -t2 (text model 2).

undur3 v3, released May 22, 2008, uses an improved dictionary but the same compressor and decompressor as v1 and v2. The dictionary contains 123,995 lowercase words separated by NUL bytes. Of these, 5579 words occur more than once (wasted space?) I tested options -m1500 under Ubuntu Linix with 2 GB memory. At -m1500 top reports 2157 MB virtual memory and 1894 MB real memory. -m1600 caused disk thrashing.

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp   Notes
-------           -------                     ----------  -----------  -----------  -----------  ----- -----   -----  
durilca'light 0.5   -m650  -o12               21,089,993  178,562,475  1,495,422 x  180,057,897   1227 (fails)  
durilca 0.5         -m700  -o12 -t0           19,227,202  162,117,578     74,292 x  162,191,870   4140 (fails)
                    -m800  -o128              19,321,003  164,298,178     74,292 x  165,372,470   7718 (fails)
                    -m700  -o12 -t2(3)        18,520,589    (fails)    1,507,312 x                3330  3940
durilca 0.5(Hutter) -m700  -o13 -t2           18,128,339    (fails)       77,295 x                5905
                    -m1650 -o21 -t2           17,958,687    (fails)       77,295 x                6140  6140
durilca4linux_1     -m700  -o13 -t2           18,128,334                  23,375 xd               5950  5880
                    -m1750 -o12 -t2           18,027,888  146,521,559     23,375 xd 146,544,934   5500  7301    18
                    -m1750 -o24 -t2           17,949,422                  23,375 xd               6190  6780
durilca4linux_2     -m1800 -o10 '-t2(11)'     17,002,831  136,536,189    241,322 xd 136,777,511   4249  4827    18
                    -m1800 -o10 -t2           16,998,300  136,596,818    241,322 xd 136,838,140   4405  4894    18
durilca4linux_3 v1  -m3600 -o14 -t2           16,356,063  129,933,145    345,957 xd 130,279,102   3649  3715    18
                    -m1200 -o32 -t2           16,348,796                                          4170  4178    18
durilca4linux_3 v2  -m3600 -o14 -t2           16,323,581  129,670,441    344,525 xd 130,014,966   3628  3639    18
                    -m1200 -o32 -t2           16,316,255                                          4148  4157    18
durilca4linux_3 v3  -m3600 -o14 -t2           16,292,414  129,469,384    339,990 xd 129,809,374   3624  3627    18
                    -m1200 -o32 -t2           16,285,285                                          4135  4138    18
                    -m1500 -o6  -t2           16,517,051  133,674,565                             3852
                    -m1500 -o7  -t2           16,418,799  132,239 495                             4006
                    -m1500 -o8  -t2           16,368,632  131,722,213                             4149
                    -m1500 -o9  -t2           16,335,259  131,549,901    339,990 xd 131,889,891   4261  4344
                    -m1500 -o10 -t2           16,316,775  131,574,739                             4405
                    -m1500 -o11 -t2           16,306,086  131,707,901                             4544
                    -m1500 -o12 -t2           16,299,411  131,807,298                             4554
                    -m1500 -o14 -t2           16,292,414  132,238,662                             4763
                    -m1500 -o16 -t2           16,289,512  132,516,825                             4879
                    -m1500 -o32 -t2           16,285,285  134,238,759                             5440

.1323 paq8hp12any

paq8hp12any is the top ranked program of the PAQ series of context mixing compressors, described below in chronological order. All can be found at this link, except as noted. paq8hp* series compressors can also be found here. All programs are free, GPL open source, command line archivers. Most take a single option controlling memory usage.

p5, p6, and p12 (Matt Mahoney, May 13, 2000) use a neural network with 256K or 4M inputs, no hidden layer and a single output to predict the next bit of input, given hashes of various contexts to select active inputs. The output is arithmetic coded. p5 uses 1 MB memory and context orders 0 to 3. p6 uses 16 MB and orders 0-5. p12 uses 16 MB, orders 1-4 and word-level orders 0-1 as an optimization for text. The programs take no options. The algorithm is described in M. Mahoney, Fast Text Compression with Neural Networks, Proc. AAAI FLAIRS, Orlando, 2000 (C) 2000, AAAI.

paq1 (Matt Mahoney, Jan. 6, 2001) replaces the neural network in p5, p6, p12 with a fixed weighted averaging of model outputs. Described in an unpublished report, M. Mahoney, The PAQ1 Data Compression Program, 2002.

paq6 (Matt Mahoney and Serge Osnach, Dec. 30, 2003) evolved as a series of improvements to paq1. It is described in M. Mahoney, Adaptive Weighing of Context Models for Lossless Data Compression, Florida Tech. Technical Report CS-2005-16, 2005. The most significant improvements are replacing the fixed model weights with adaptive linear mixing (Matt Mahoney), and SSE (secondary symbol estimation) postprocessing on the output probability, and modeling of sparse contexts (Serge Osnach). Other models were added for x86 executable code, and automatic detection of fixed length records in binary data.

paqar 4.5 (Alexander Ratushnyak, Feb. 13, 2006) is the last of a long series of improvements to paq6 by Alexander Ratushnyak (paqar: multimixer model, .exe preprocessor, other model improvements), Przemyslaw Skibinski (WRT text preprocessing), Berto Destasio (model tuning), Fabio Buffoni (speed optimizations), David. A Scott (arithmetic coder optimizations), Jason Schmidt (model improvements), and Johan de Bock (compiler optimizations). For text, the biggest improvement was from WRT (Word Reducing Transform), which replaces words with shorter codes from an external English dictionary to PAsQDa 1.0 on Jan. 18, 2005. WRT is described in P. Skibiński, Sz. Grabowski, and S. Deorowicz, Revisiting dictionary-based compression, Software - Practice & Experience, 35 (15), pp. 1455-1476, December 2005. There were a great number of versions by many contributors, mostly in 2004 when the PAQ series moved to the top of most compression benchmarks and attracted interest. Prior to PAQ, the top ranked programs were generally closed source.

paq8f (Matt Mahoney, Feb. 28, 2006) evolved from paq7 (Dec. 24, 2005) as a complete rewrite of paq6/paqar. The important improvements were replacing the adaptive linear mixing of models with a neural network (coded in MMX assembler), a more memory-efficient mapping of contexts to bit histories using a cache-aligned hash table, adaptive mapping of bit histories to probabilities, and models for bmp, tiff, and jpeg images. It models text using whole-word contexts and case folding, like all versions back to p12, but lacks WRT text preprocessing. It served as a baseline for the Hutter prize. Details are in the source code comments.

paq8g (Przemyslaw Skibinski, Mar. 3, 2006) adds back WRT text preprocessing.

paq8h (Alexander Ratushnyak, Mar. 24, 2006) added additional contexts to the neural network mixer. It was top ranked on enwik9 (but not enwik8) when the Hutter prize was launched on Aug. 6, 2006. This is the 78'th version since p5.

raq8g by Rudi Cilibrasi, released 0721Z Aug. 16, 2006, is a modification of paq8f. It adds a NestModel to model nesting of parenthesis and brackets. The test below for -7 is based on a Windows compile, raq8g.exe. The test for -8 was under Linux. The unzipped Linux executable is 27,660 bytes.

paq8hp1 (source code) by Alexander Ratushnyak, 1945Z Aug. 21, 2006. It is a modification of paq8h using a custom dictionary tuned to enwik8 for the Hutter prize. Because the Hutter prize requires no external dictionaries, the dictionary is spliced into the .exe file during the build process. When run, it creates the dictionary as a temporary file. The program must be run in the current directory (not in your PATH or with an explicit path), or else it can't find this file. The unzipped paq8hp1.exe is 206,764 bytes. Decompression was verified for enwik8 (60730 ns/b for -8, 60660 ns/b for -7). enwik9 is pending.

paq8hp2 (source code) by Alexander Ratushnyak, 0233Z Aug. 28, 2006 is an improved version of paq8hp1 submitted for the Hutter prize. paq8hp2.exe size is 205,276 bytes. It differs from paq8hp1 mainly in that the 43K word dictionary for 2-3 byte codes is sorted alphabetically. The 80 most frequent words, coded as 1 byte before compression, are grouped by syntactic type (pronoun, preposition, etc).

paq8hp3 (source code) by Alexander Ratushnyak, released Aug. 29, 2006 is an improved version of paq8hp2 submitted for the Hutter prize on Sept. 3, 2006. The 80 dictionary words coded with 1 byte and 2560 words coded with 2 bytes are organized into semantically related groups or by common suffixes. The 40,960 words with 3 byte codes are sorted from the last character in reverse alphabetical order. paq8hp3.exe is 178,468 bytes unzipped. enwik9 decompression is not yet verified. For enwik8, decompression is verified with time 60300 ns/b compression, 60220 ns/b decompression.

paq8hp4 (source code) by Alexander Ratushnyak, released and submitted for the Hutter prize on Sept. 10, 2006, is an improved version of paq8hp3. The dictionary is further organized into semantically related groups among 3-byte codes. The unzipped size of paq8hp4.exe is 206,336 bytes.

paq8hp5 (source code) by Alexander Ratushnyak, released Sept. 20, 2006, is an improved version of paq8hp4, submitted for the Hutter prize on Sept. 25, 2006. The unzipped size of paq8hp5.exe is 174,616 bytes (in spite of a slightly larger dictionary). The dictionary size is optimized for enwik8; a larger dictionary would improve compression of enwik9. Decompression is verified for enwik8 only (-8 at 74640 ns/b). A Linux port of paq8hp5 is by Лъчезар Илиев Георгиев (Luchezar Georgiev), Oct 26, 2006 (mirror).

paq8hp6 (source code) by Alexander Ratushnyak, released Oct. 29, 2006, is an improved version of paq8hp5. It was submitted as a Hutter prize candidate on Nov. 6, 2006. Unzipped paq8hp6.exe size is 170,400 bytes. The -8 option was not tested on enwik9 due to disk thrashing on my 2 GB PC. Compression was about 25% finished after 9 hours.

paq8j by Bill Pettis, Nov. 13, 2006, is based on paq8f (no dictionary) with model improvements taken from paq8hp5. It is a general purpose compressor like paq8f, not specialized for text.

paq8ja.zip by Serge Osnach, Nov. 16, 2006, is an improvement of paq8j, using additional contexts based on character classifications.

paq8jb.zip by Serge Osnach, Nov. 22, 2006, adds contexts using the distance to an anchor byte (x00, space, newline, xff) combined with previous characters. The -8 test caused some minor disk thrashing at 2 GB memory under WinXP Home (82% CPU usage). Time reported is wall time.

paq8jc.zip by Serge Osnach, Nov. 28, 2006, improves the record model for better compression of some binary files, although it is slightly worse for text. Time for -8 is wall time at 72% CPU usage.

paq8hp7a by Alexander Ratushnyak, Dec. 7, 2006, was intended to supercede paq8hp6 as a Hutter prize entry, then was withdrawn on Dec. 10, 2006 with the release of paq8hp7. Unzipped executable size is 151,664 bytes. -8 for enwik9 (but not enwik8) caused disk thrashing on my computer (2 GB, WinXP).

paq8hp7 (source code) by Alexander Ratushnyak, Dec. 10, 2006, as a Hutter prize entry. Unzipped paq8hp7.exe size is 152,556 bytes.

paq8jd by Bill Pettis, Dec. 30, 2006, improves on paq8j with additional SSE (APM) stages. enwik8 -8 caused some disk thrashing at 2 GB memory.

paq8hp8 (source code) by Alexander Rasushnyak, Jan. 18, 2007, as a Hutter prize entry (replacing an incorrect version posted 2 days earlier). Unzipped size is 152,692 bytes. The dictionary is identical to paq8hp7.

paq8k is by Bill Pettis, Feb. 13, 2007.

paq8hp9 (mirror) (source code) by Alexander Ratushnyak, Feb. 20, 2007, is a Hutter prize entry. Only the -7 option works. The unzipped size of paq8hp9.exe is 112,628 bytes.

paq8hp9any (Feb. 23, 2007) by Alexander Ratushnyak is a paq8hp9 -7 compatible version with external dictionary where all options work. However the zipped program is larger and -8 was not tested due to disk thrashing, so results are unchanged.

paq8l by Matt Mahoney, Mar. 8, 2007, is based on paq8jd. It adds a DMC model and minor improvements.

paq8hp10 (mirror), Mar. 26, 2007, by Alexander Ratushnyak was derived from paq8hp9 as a Hutter prize entry. The unzipped size is 103,224 bytes. Only the -7 option works.

paq8hp10any, (source code), Mar. 31, 2007, by Alexander Ratushnyak is archive compatible with paq8hp10 -7 but works with other memory options. When run, paq8hp10.exe and both dictionary files should be in the current directory. This program is not a Hutter prize entry.

paq8hp11 (mirror) by Alexander Ratushnyak, Apr. 30, 2007, is a Hutter prize entry. paq8hp11.exe is 99,816 bytes. Like paq8hp10, it works only with the -7 option.

  To compress:   paq8hp11 -7 enwik8.paq8hp11 enwik8
  To decompress: paq8hp11 enwik8.paq8hp11

paq8hp11any (source code) by Alexander Ratushnyak, May 2, 2007, is a paq8hp11 variant that accepts any memory option. It was optimized for speed rather than size. It includes two dictionary files which must be present in the current directory when run, unlike paq8hp11 where the dictionary is self extracted. -8 selects 1850 MB memory. -7 produces the same archive as paq8hp11. Run speeds for -8 enwik8 are 76770+76820 ns/B.

paq8hp12 (mirror) by Alexander Ratushnyak, May 14, 2007, is a Hutter prize entry. paq8hp12.exe size is 99,696 bytes. It works only with the -7 option like paq8hp11.

paq8hp12any (source code) by Alexander Ratushnyak, May 20, 2007, is a paq8hp12 varient that accepts any memory option (like paq8hp11any). The -7 option produces an archive identical to that of paq8hp12.

paq8fthis2 by Jan Ondrus, Aug. 12, 2007, is paq8f with an improved model for compressing JPEG images. It is otherwise archive compatible with paq8f for data without JPEG images (such as enwik8 and enwik9).

paq8n by Matt Mahoney, Aug. 18, 2007, combines paq8l with the JPEG model from paq8fthis2.

paq8o and paq8osse by Andreas Morphis, Aug 22 2007, is paq8n with an improved model for .bmp images. There are two executables that produce identical archives. paq8o.exe is for Pentium MMX or higher. paq8osse.exe is for newer processors that support SSE2 instructions like the Pentium 4. It is about 8% faster, but uses more memory. Both use the same C++ source but use different (but equivalent) assembler code to implement the neural network mixer. paq8osse.exe was compiled with Intel C++, which produces slightly faster executables than g++ used in earlier versions. The current version is paq8o ver. 2 (Aug. 24, 2007), which fixes the file name extension (was .paq8n) but does not change compression. The benchmark is based on the first version.

paq8o3 by KZ, Sept. 11, 2007, combines paq8o with an improved JPEG model from paq8fthis3 (Jan Ondrus, Sept. 8, 2007) and an improved model for grayscale PGM images from paq8i (Pavel Holoborodko, Aug. 18, 2006). Text compression is unchanged from paq8l, paq8m, paq8o, or paq8o2.

paq8o4 v1 by KZ, Sept. 15, 2007, includes a grayscale .bmp model (based on the grayscale PGM model). Text compression is unaffected. It was compiled with Intel C++. paq8o4 v2 by Matt Mahoney, Sept. 17, 2007, is a port to g++ which allows wildcards, directory traversal, and directory creation, but is 8% slower. It is archive compatible with v1.

paq8o6 by KZ, Sept. 28, 2007, is based on paq8o5 by KZ, Sept. 21, 2007 with the improved JPEG model from paq8fthis4 by Jan Ondrus, Sept. 27, 2007. paq8o5 is paq8o4 with an improved StateMap from lpaq1. The improved compression of enwik8 comes from this StateMap. Compression of enwik8 is unchanged from paq8o5 to paq8o6.

paq8o7 by KZ, Oct. 16, 2007, improves paq8o6 with improved JPEG compression and support for 4 and 8 bit BMP images. Text is not affected.

paq8o8 by KZ, Oct. 23, 2007, improves paq8o7 with improved JPEG compression further.

paq8o8-jun7 is a DOS port of paq8o8 by Rugxulo, June 7, 2008.

paq8o10t is by KZ, June 11, 2008. Discussion.

Options select memory usage as shown in the table. Early versions took no options.

           Compression     Compressed size      Decompressor  Total size   Time (ns/byte)
Program      Options      enwik8      enwik9     size (zip)   enwik9+prog  Comp  Decomp  Mem Note
-------      -------    ----------  -----------  -----------  -----------  -----  -----  --- ----
p5                      31,255,092                   9,298 s                3421           1   6
p6                      25,377,998                   9,421 s                4190          16   6
p12                     24,714,219                   9,598 s                4160          16   6
paq1                    22,156,982                  16,436 s                7800   7790   50
paq6 v2         -8      19,589,267                  26,548 s               47624         808
paqar 4.5       -7      18,388,609                 414,164 s              118690 119010  470
paq8f           -7      18,289,559                  34,371 x               68960         854
                -8      18,075,265                  34,371 x               69170        1693
paq8g           -7      17,817,246                 804,867 s               44130         854
paq8h           -7      17,674,700  147,195,723    801,612 s  147,997,335  56511  57278  854   5
raq8g           -7      18,132,399                  33,483 x               84555  84793 1089
                -8      17,923,022                  27,660 x              337430~330000 2095  17
                -8      17,923,022                  27,660 x              196540~196000 2095  15
paq8hp1         -7      17,566,769                 205,783 x               60170  60660  748
                -8      17,397,023  142,477,977    205,783 x  142,683,760  63317        1595
paq8hp2         -7      17,390,490                 204,557 x               62000  62330  747
                -8      17,223,661  141,145,684    204,557 x  141,350,241  65323        1584
paq8hp3         -7      17,241,280                 177,477 x               61360  59690  742
                -8      17,085,021  139,905,045    177,477 x  140,082,522  63420        1586
paq8hp4         -7      17,039,173                 198,525 x              ~65000  65110  755
                -8      16,889,237  138,188,695    198,525 x  138,387,220  67956  68120 1598
paq8hp5         -7      16,898,402                 161,887 x               76300  77710  900  19
                -8      16,761,044  137,017,311    161,887 x  137,179,198 ~85153  75162 1787
paq8hp6         -7      16,731,800  138,828,889    166,715 x  138,995,604  74953  73707  941
                -8      16,568,451  135,281,289    166,715 x  135,448,004  60865        1807  21
paq8j           -7      18,208,284                  39,366 s              138030 138260  959
                -8      17,991,628                  39,366 s              138990 136500 1896
paq8ja          -7      18,184,224                  39,781 s              148560 143200  993
                -8      17,968,233                  39,781 s              154700 153990 1965
paq8jb          -7      18,180,081                  39,982 s              148570 148200 1009
                -8      17,964,363                  39,982 s              188590 190190 1999
paq8jc          -7      18,185,705                  40,064 s              150910 152080 1017
                -8      17,970,943                  40,064 s              224410 234900 2015
paq8hp7a        -7      16,592,672  137,441,743    150,678 x  137,592,421  79795         940
                -8      16,431,239                 150,678 x               76940  77600 1790
paq8hp7         -7      16,579,500                 151,633 x               79620  79660  940
                -8      16,417,646  133,835,408    151,633 x  133,987,041  66074        1850  21
paq8jd          -7      18,158,159                  40,460 s              157340 156350 1030
                -8      17,943,042                  40,460 s              406730        2028
paq8hp8         -7      16,528,353                 151,711 x               79580  79970  940
                -8      16,372,960  133,271,398    151,711 x  133,423,109  64639        1849  22
paq8k           -8      18,239,915                  41,881 s              457150        1463
paq8hp9         -7      16,516,789  136,676,674    111,653 x  136,788,327  84529  85957  940
paq8l           -6      18,518,485                  35,955 x              133910         435
                -7      18,168,563                  35,955 x              134770         837
                -8      17,916,450                  35,955 x              136000 136390 1643
paq8hp10        -7      16,490,947                 102,256 x               86720  88890  940
paq8hp10any     -8      16,335,197  132,979,531    333,925 x  133,313,456  55639        1849  22
paq8hp11        -7      16,459,515                  98,851 x              129540 128530  947
paq8hp11any     -8      16,304,862  132,757,799    327,608 s  133,085,407  57503        1850  22
paq8hp12        -7      16,381,959                  98,745 x              130820 131480  936
paq8hp12any     -7      16,381,959                 330,700 x               78860  76190  941
                -8      16,230,028  132,045,026    330,700 x  132,375,726  56993        1850  22
paq8fthis2      -8      18,075,265                  34,846 x               69100  69310 1693
paq8n           -8      17,916,420                  37,402 x              134880 135480 1643
paq8o           -8      17,916,451                  42,389 s              135850 135260 1643
paq8osse        -8      17,916,451                  42,290 s              125260 124570 1778
paq8o3          -8      17,916,450                  43,745 s              134580 134530 1636
paq8o4 v1       -8      17,916,450                  43,876 s              126780 126560 1636
paq8o6          -8      17,904,721                  44,883 s              139530 139520 1712
paq8o7          -8      17,904,756                  45,979 s              139140 138530 1574
paq8o8          -8      17,904,756                  46,381 s              139370 139150 1574
paq8o8-intel    -1      22,260,679                  46,381 s               24687          37  24
paq8o8z-jun7    -1      22,260,679                  49,085 s               25919          37  24
                -1      22,260,680                                         29639          37  25
paq8o10t        -8      17,772,821                  50,865 s              144250 143720 1591

paq8hp1 through paq8hp12 can be used as a preprocessor to other compressors by compressing with option -0. In the following tests on ppmonstr, options were tuned for the best possible compression of enwik8 with 2 GB memory (1.65 GB available under WinXP). The xml-wrt 2.0 options are -l0 -w -s -c -b255 -m100 -e2300 (level 0, turn off word containers, turn off space modeling, turn off containers, 255 MB buffer for dictionary, 100 MB buffer, 2300 word dictionary). The xml-wrt 3.0 options are -l0 -b255 -m255 -3 -s -e7000 (-3 = optimize for PPM).

xml-wrt prepends the dictionary to its output. To make the comparison fair, the compressed size of the dictionary must be added. This is done in two ways, first by compressing the preprocessed text and dictionary and adding the compressed sizes, and second by prepending the dictionary to the preprocessed text before compression. The first method compresses about 1-2 KB smaller.

The uncompressed size of each dictionary for paq8hp1 through paq8hp4 is 398,210 bytes. They contain identical words, but in different order. The first two dictionaries are identical. They compress smaller because they are sorted alphabetically. The dictionary for paq8hp5 is 411,681 bytes. It contains all of the words in the first 4 dictionaries plus 1280 new words (44,880 total).

Preprocessor    Compressor                 enwik8     dict      total    dict+enwik8
------------    ----------               ----------  -------  ----------  ---------
paq8hp1 -0    | ppmonstr J -m1650 -o64   18,322,077   81,190  18,403,267  18,403,991
paq8hp2 -0    | ppmonstr J -m1650 -o64   18,266,424   81,190  18,347,614  18,349,587
paq8hp3 -0    | ppmonstr J -m1650 -o64   18,197,797  107,583  18,305,380  18,306,690
paq8hp4 -0    | ppmonstr J -m1650 -o64   18,170,944  107,590  18,278,534  18,280,098
paq8hp5 -0    | ppmonstr J -m1650 -o64   18,154,921  111,935  18,266,856  18,267,556
xml-wrt 2.0   | ppmonstr J -m1650 -o64   18,625,624
xml-wrt 3.0   | ppmonstr J -m1650 -o64   18,494,374
 (none)         ppmonstr J -m1650 -o16   19,062,555
                ppmonstr J -m1650 -o32   19,084,964
                ppmonstr J -m1650 -o64   19,098,634

The transform done by paq8hp1 through paq8hp5 is based on WRT by Przemyslaw Skibinski, which first appeared in PAsQDa and paqar, and later in paq8g and xml-wrt. The steps are as follows:

WRT has additional capabilities depending on input, such as skipping encoding if little or no text is detected. The dictionary format is one word per line (linefeed only) with a 13 line header.

.1448 drt|lpaq9i

lpaq versions 1 through 8 may be downloaded here. lpaq9* can be downloaded here.

lpaq1 is a free, open source (GPL) file compressor by Matt Mahoney, July 24, 2007. It uses context mixing. It is a "lite" version of paq8l, about 35 times faster at the cost of about 10% in compression. The "9" option selects maximum memory. The options range from 0 (6 MB) to 9 (1.5 GB). Memory usage is 3 + 3*2N MB, N = 0..9.

The compressor mixes 7 contexts: orders 1, 2, 3, 4, 6, a unigram word context (consecutive letters, case insensitive), and a matched bit context. The contexts (except the matched bit) are mapped to nonstationary bit histories using nibble-aligned hash tables, then mapped to bit prediction probabilities using stationary adaptive tables with bit counts to control adaptation rate. The matched bit context maps the predicted bit (based on a context match), match length and order-1 context (or order 0 if no match) to a bit prediction. The probabilities are combined in the logistic domain (log(p/(1-p)) using a single layer neural network selected by a small context (3 high bits of last byte + context order), then passed through 2 SSE stages (orders 0 and 1) and arithmetic coded. Except for one model for ASCII text, there are no specialized models for binary data, .exe, .bmp, .jpeg, etc.

lpaq2 by Alexander Ratushnyak, Sept. 20, 2007, contains some speed optimizations.

lprepaq 1.2 by Christian Schnaader, Sept. 29, 2007, is lpaq1 combined with precomp as a preprocessor. precomp compresses JPEG files and also expands data segments compressed with zlib, often making them more compressible. This preprocessing has no effect on text files.

lpaq3 and elpaq3 by Alexander Ratushnyak, Sept. 29, 2007, has two versions with the same source code. When compiled with -DWIKI, the result is elpaq3 which is tuned for large text files. The normal compile produces lpaq3.

lpaq3a by Alexander Ratushnyak, Sept. 30, 2007, improves compression on some files over lpaq3 (but not enwik8/9). The archive also contains lpaq3e.exe, which is an archive compatible (Intel compile) of elpaq3.exe.

lpaq4 and lpaq4e (mirror) are by Alexander Ratushnyak, Oct. 1, 2007. lpaq4e is tuned for large text files.

lpaq5 and lpaq5e are by Alexander Ratushnyak, Oct. 16, 2007. Option 9 selects 1542 MB memory. lpaq5e is tuned for large text files. It includes separate programs for compression only (lpaq5e-c.exe) and decompression only (lpaq5e-d.exe). Tests were done with these programs, rather than the version that does both (lpaq5e.exe).

lpaq6 and lpaq6e are by Alexander Ratushnyak, Oct. 22, 2007. Option 9 selects 1542 MB memory. lpaq6e is tuned for large text files. lpaq6 includes a E8E9 transform for compressing x86 executables.

lpaq7 and lpaq7e (mirror) are by Alexander Ratushnyak, Oct. 31, 2007.

lpaq8 and lpaq8e are by Alexander Ratushnyak, Dec. 10, 2007. The executables are packed with upack. zip -9 would make them larger.

lpaq1a by Matt Mahoney, Dec. 21, 2007, uses the same model as lpaq1 but replaces the arithmetic coder with the asymmetric binary coder from fpaqb.

lpq1 by Matt Mahoney, Dec. 23, 2007, is an archiver (not a file compressor) based on lpaq1 option 7.

drt|lpaq9e (mirror) is by Alexander Ratushnyak, Feb. 20, 2008. It is specialized for English text. It includes a separate program drt.exe (without source code) which performs a dictionary transform prior to compression with lpaq9e. The option 9 is for lpaq9e which selects maximum memory. The program size is computed by adding lpaq9e.exe, drt.exe, and the compressed dictionary, which must be uncompressed with lpaq9e before running. The size is smaller without a zip archive. Decompression consists of uncompressing the dictionary with lpaq9e, uncompressing the transformed file with lpaq9e, and reversing the transform with drt. Run times are for the sum of all three operations (1+62+2943, 1+2929+45 sec).

lpaq9f by Alexander Rasushnyak, Apr. 27, 2007, works like lpaq9e. Run times are (2+55+2801, 2+2819+38 sec). drt uses 8 MB for compression and 4 MB for decompression.

lpaq9g by Alexander Rasushnyak, May 23, 2008, works like lpaq9e. Run times are (2+51+2691, 2+2682+38 sec).

lpaq9h by Alexander Rasushnyak, June 3, 2008, works like lpaq9e. Run times are (2+53+2530, 2+2529+44 sec).

lpaq9i by Alexander Rasushnyak, June 13, 2008, works like lpaq9e. Run times are (2+59+2425, 2+2453+46 sec). drt.exe and the dictionary file (tmpdict0.dic) are unchanged in all versions starting with lpaq9f.

Prog       Opt     enwik8      enwik9         prog       Total       Comp  Deco Mem  Alg
----       ---   ----------  -----------      ----     -----------   ----  ---- ---- ---
lpaq1       9    19,755,948  164,508,919      6,676 x  164,515,595   3646  3594 1539 CM
lpaq2       9    19,755,471  164,496,295      6,888 x  164,503,183   3260  3354 1539 CM
lprepaq 1.2 9    19,755,989  164,509,300    189,891 x  164,699,191   8696  7888 1582 CM
lpaq3       9    19,580,276  165,600,121      7,514 x  165,607,635   3695  3735 1542 CM
elpaq3      9    19,392,604  160,081,507      7,377 x  160,088,884   3411  3454 1542 CM
lpaq3a      9    19,585,951  165,661,890     12,004 s  165,673,894   4177  4163 1542 CM
lpaq3e      9    19,392,604  160,081,507     12,004 s  160,093,511   3967  3932 1542 CM
lpaq4       9    19,583,905  165,603,612      7,117 x  165,610,729   3693  3697 1542 CM
lpaq4e      9    19,358,662  159,675,213      6,990 x  159,682,203   3383  3422 1542 CM
lpaq5       9    19,455,395  161,410,276      8,382 x  161,418,658   3614  3630 1542 CM
lpaq5e      9    19,078,767  156,194,860      7,841 xd 156,202,701   3428  3605 1542 CM
lpaq6       9    19,562,861  165,224,012      8,848 x  165,232,860   3586  3624 1542 CM
lpaq6e      9    19,054,076  155,943,020      8,866 x  155,951,886   3420  3478 1542 CM
lpaq7       9    19,557,894  162,359,435      9,078 x  163,368,513   3922  3850 1542 CM
lpaq7e      9    19,039,516  155,840,757      8,570 x  155,849,327   3477  3490 1542 CM
lpaq8       9    19,523,803  161,987,713      9,676 x  161,997,389   3682  3718 1542 CM
lpaq8e      9    18,982,007  155,232,477      8,888 x  155,241,365   3424  3475 1542 CM
lpaq1a      9    19,759,778  164,547,926      8,558 x  164,556,484   3462  3423 1540 CM
lpq1             19,888,399  168,467,267      9,151 x  168,476,408   3389  3402  387 CM
drt|lpaq9e  9    18,151,024  145,628,635    110,844 x  145,739,479   3006  2975 1542 CM
drt|lpaq9f  9    18,079,247  144,877,844    110,864 x  144,988,708   2858  2859 1542 CM
drt|lpaq9g  9    18,069,107  144,838,636    110,318 x  144,948,954   2744  2722 1542 CM
drt|lpaq9h  9    18,067,711  144,763,248    110,376 x  144,873,624   2585  2575 1542 CM
drt|lpaq9i  9    18,065,347  144,752,858    110,149 x  144,863,007   2486  2501 1542 CM

drt may be combined with other compressors to improve compression. The following were obtained using drt and tmpdict0.dic (from lpaq9i) with ppmonstr J (PPM). Option -m1650 selects 1650 MB memory. -r1 partially rebuilds the model when memory is exhausted. -o select the PPM model order. Compression time is for ppmonstr only. Mem8 is actual memory used to compress enwik8.drt. enwik9.drt always uses 1650 MB. As a separate compressor, the compressor size would be 147,915 for a zip file containing drt.exe, ppmonstr.exe, and tmpdict0.pmm (tmpdict0.dic compressed with ppmonstr -m1650 -r1 -o64). Total size would be 148,047,289.

    Compressors          options         enwik8    enwik9       Comp Mem8
-------------------  ----------------  ----------  -----------  ---- ----
drt 9i | ppmonstr J  -m1650 -r1 -o10   18,185,633  147,936,682  2509  825
                     -m1650 -r1 -o11   18,166,961  147,899,374  2634  895
                     -m1650 -r1 -o12   18,152,982  147,907,628  2661  953
                     -m1650 -r1 -o16   18,142,625  148,306,179  2888 1109
                     -m1650 -r1 -o32   18,124,722  149,857,650  3361 1371
                     -m1650 -r1 -o64   18,122,785  151,343,426  3870 1554
                     -m1650 -r1 -o128  18,130,333                    1650

.1489 xwrt | ppmonstr

xml-wrt 2.0 and higher and xwrt 3.2 can be used as either a standalone compressor or as a preprocessor to other compressors. The table below shows the best known settings for enwik9 and enwik8 for xml-wrt 3.0 and 2.0 as a preprocessor to ppmonstr var. J, the best known combination for which xml-wrt improves compression. xml-wrt 1.0 is a preprocessor only. See also xml-wrt and xwrt as a standalone compressor.


                                                                         Compressed size      Decompressor  Total size   Time (ns/byte)
Program/options                                                         enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------------------------------------------------------------------   ----------  -----------  -----------  -----------  ----- -----  --- ---
xml-wrt 3.0 -l0 -b255 -m255 -3 -s -e20000    | ppmonstr J -m1650 -o10 18,592,499  150,004,636     82,466 sx 150,087,102   3067  2708 1650 PPM
xml-wrt 3.0 -l0 -b255 -m255 -3 -s -e7000     | ppmonstr J -m1650 -o64 18,494,374                  82,466 sx               3500  3340 1650 PPM
xml-wrt 2.0 -l0 -w -s -c -b255 -m100 -e10000 | ppmonstr J -m1700 -o10 18,794,295  150,651,873     67,309 sx 150,719,182   2715 ~2650 1700 PPM
xml-wrt 2.0 -l0 -w -s -c -b255 -m100 -e2300  | ppmonstr J -m1650 -o64 18,625,624                  67,309 sx               3550  3360 1650 PPM
xml-wrt 2.0 -l0 -w -s -c -b255 -m100 -e10000 | ppmonstr J -m800 -o8   18,863,790  154,223,582     67,309 sx 154,290,891   2820        800 PPM
xml-wrt 1.0 -f800                            | ppmonstr J -m800 -o8   19,043,178  154,749,585     56,837 sx 154,806,422   2702 ~2700  800 PPM

xml-wrt 1.0 (XML Word Reducing Transform) is a free command line single file preprocessor with source code by Przemyslaw Skibinski, May 10, 2006. It is not intended to compress files by itself (although it does somewhat). Rather, it is intended to improve the compressibility of text and XML files by replacing common words and XML substrings with shorter symbols. (So it is actually LZW with a static dictionary prepended to the output). It improves compression for most programs except for those that already have English text models such as paq8h. Some additional results are shown below for combinations with some other compressors.

                     Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program                Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Notes
-------                -------                     ----------  -----------  -----------  -----------  ----- -----  -----
xml-wrt 1.0|ppmonstr J -f1800 | -m800 -o10         18,965,658  155,066,074     56,837 sx 155,122,911   2905  2809
xml-wrt 1.0|slim23d    -f1800 | -m700 -o12         19,163,987  156,734,571     69,453 x  156,804,024   4702  4717
xml-wrt 1.0|ppmd J1    -f1800 | -m256 -o8 -r1      21,128,019  178,154,529     25,917 s  178,180,446    717   722

The following table shows the compressed size (without decompressor except SFX) of enwik8 before and after the XML-WRT transform with option -f180 for several compressors. A ratio less than 1 means that XML-WRT improves compression.


Program           Options                       enwik8   enwik8.xwrt  Ratio   Alg
-------           -------                    -----------  ----------  ------  ---
paq8h             -7                          17,674,700  18,341,959  1.0378  CM
ppmonstr J        -o10 -m800                  19,338,065  18,886,224  0.9766  PPM
slim23d           -m700 -o10                  19,264,094  18,938,602  0.9830  PPM
WinUDA 2.91       mode 3 (194 MB)             20,332,366  20,859,165  1.0259  CM
ppmd J1           -o10 -m256 -r1              21,388,296  20,945,220  0.9793  PPM
uhbc 1.0          -m3 -b100m                  20,930,838  21,171,204  1.0115  BWT
M03exp            32 MB                       21,948,192  21,583,059  0.9834  BWT
sbc               -ad -m3 -b63                22,470,539  22,216,425  0.9887  BWT
WinRAR 3.60b3     -mc7:128t+ -sfxWinCon.sfx   22,713,569  22,457,785  0.9887  PPM
PX 1.0                                        24,971,871  22,818,070  0.9137  CM
uharc 0.6b        -mx -md32768                23,911,123  22,915,299  0.9583  PPM
chile 0.3d-1      -b=40000                    23,408,335  22,884,519  0.9776  BWT
cabarc 1.00.0601  -m lzx:21                   28,465,607  25,739,214  0.9042  LZ77
WinACE            -sfx -m5                    30,919,182  27,112,651  0.8769
bzip2 1.0.3                                   29,008,758  27,339,845  0.9425  BWT
gzip 1.3.5        -9                          36,445,248  30,403,738  0.8342  LZ77
pkzip 2.0.4                                   36,934,712  30,729,525  0.8432  LZ77
thor 0.9a         ex                          41,670,916  32,586,444  0.7820
compress 4.3d                                 45,763,941  38,485,494  0.8409  LZW
Original size                                100,000,000  52,174,989  0.5217

The -f option (default -f6) selects the minimum word frequency required to have it added to the dictionary. The optimal setting depends on the input size. When used with ppmd or ppmonstr (the best compressors improved by XML-WRT), the optimal settings are about -f180 for enwik8 and -f1800 for enwik9, which results in a dictionary of 7697 words for enwik8 and 6657 words for enwik9. The following table shows the effect of the -f and -o options for ppmonstr -m800 enwik9. The best combination found is -f1800 -o8.

 -f       -o7          -o8          -o9          -o10        -o11         -o12         -o16         -o32
 ---  -----------  -----------  -----------  -----------  -----------  -----------  -----------  -----------
 100                                                                   155,908,621
 200                                                                   155,775,164
 300                                                                   155,653,815
 500               154,884,542               155,367,681  155,465,355  155,547,660
 600               154,787,455                                         155,497,645
 800               154,749,585
1000  154,909,136  154,794,501  154,951,751  155,122,278  155,306,526  155,409,926  155,948,066  157,901,320
1500  155,092,513  154,895,455  154,999,654  155,073,186  155,306,526  155,301,322
1800  155,191,178  154,924,936  155,036,534  155,066,074  155,366,281  155,297,828
2000               154,998,528                                         155,296,112
3000                                                                   155,379,959

The following table shows that the optimal setting for -f is lower for smaller files (with ppmd):

              Compression          Compressed size      Decompressor  Total size   Time (ns/byte)
Program         Options           enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  
-------         -------         ----------  -----------  -----------  -----------  ----- -----  
  xml-wrt 1.0   -f1800         (70,826,140)(532,089,443)   (14,818 s)(532,104,261)  (115) (103)
+ ppmd J        -m256 -o8 -r1   21,128,019  178,154,529     41,653 sx 178,196,182    712   723
  xml-wrt 1.0   -f180          (52,174,989)(468,964,104)   (14,818 s)(468,978,922)  (113) (103)
+ ppmd J        -m256 -o8 -r1   20,910,527  178,215,315     41,653 sx 178,256,968    690   699
ppmd J          -m256 -o10 -r1  21,388,296  183,964,915     26,835 x  183,991,750    880   895

The default values of -s (disable spaces model) and -t (disable try smaller word) appear to work best on this data.

xml-wrt -f1800 enwik9 | ppmonstr -m800 -o12
-------------------------------------------
(default)   154,924,936
-s          155,040,558
-t          155,421,035
-s -t       155,542,575

xml-wrt 2.0 released June 14, 2006 (updated June 19, 2006) has additional transform options, and also includes LZ77 (zlib) and LZMA (LZ with arithmetic coding) compression. When used as a preprocessor, this compression is turned off. enwik9 was compressed using the options:

  xml-wrt -l0 -w -s -c -b255 -m100 -e10000 enwik9
  ppmonstr e -o8 -m800 enwik9.xwrt

The option -l0 turns off compression. -w turns off word containers. -s turns off space modeling (this hurts compression in version 1.0 but helps in 2.0). -c turns off word and number containers (independent of -w and -n. -n hurts compression). -b255 sets memory for the dictionary to 255 MB, the maximum. -m100 sets the memory buffer to 100 MB, which is not maximum (255 MB), but larger values hurt compression. -e10000 sets the dictionary size to 10000 words. (The dictionary size can also be controlled with -f as in version 1.0, but using -e is less dependent on input size so it helps with enwik8). Additional tests showing the effects of -e, -m, and -o:

xml-wrt 2.0 options                ppmonstr J     enwik9
--------------------------------   ----------   -----------
-l0 -w -s -c -b255 -m100 -e10000 | -m800 -o8    154,223,582
-l0 -w -s -c -b255 -m100 -e8000  | -m800 -o8    154,234,621  (smaller -e)
-l0 -w -s -c -b255 -m100 -e12000 | -m800 -o8    154,239,769  (larger -e)
-l0 -w -s -c -b255 -m50  -e10000 | -m800 -o8    154,259,117  (smaller -m)
-l0 -w -s -c -b255 -m100 -e10000 | -m800 -o7    154,322,272  (smaller -o)
-l0 -w -s -c -b255 -m150 -e10000 | -m800 -o8    154,426,554  (larger -m)
-l0 -w -s -c -b255 -m100 -e10000 | -m800 -o9    154,445,811  (larger -o)

The optimal values of -w -c -s -n (turn off number containers) and -t (turn off try shorter words) was determined on enwik7 and enwik8 but not tested on enwik9.

A bug fix for LZMA compression, released June 19, 2006, does not change any values for the June 14, 2006 version (using the -l0 option). However the compressed source code increases from 25,290 bytes to 25,354 bytes. The June 14 version is no longer published. The URL is unchanged.

xml-wrt 3.0 (Sept. 14, 2006) option -3 means to optimize the default settings for PPM compressors. Version 3.0 also has a FastPAQ8 compressor for standalone compression which was tested separately.

xwrt 3.2 (see below) with ppmonstr J has the following results.

xwrt 3.2 options        ppmonstr J opt    enwik8      enwik9        program size      total        Comp    Decomp   Mem
----------------------  --------------  ----------  -----------  -----------------  -----------  --------  ------- ----
-2 -b255 -m255 -s -f64   -o10 -m1650    18,456,706  148,915,761  52,569s + 26,835x  148,995,165  475+2512  43+2503 1650
-2 -b255 -m255 -s -f64   -o64 -m1650    18,397,126                                               210+2810  50+2884 1527

ppmonstr option -o64 is optimal for enwik8, but -o10 is optimal for enwik9. -m1650 selects 1650 MB memory. xwrt option -2 optimizes for PPM. -b255 selects buffer size 255 MB for building the dictionary. -m255 selects 255 MB memory buffer. -s turns off space modeling. -f64 sets minimum word frequency for the dictionary to 64. Program size and times are xwrt + ppmonstr. Memory usage is 512 MB for xwrt, 1650 MB for ppmonstr.

.1512 xwrt

xml-wrt 2.0 is a free command line file compressor with source available, by Przemyslaw Skibinski, June 19, 2006. It uses LZMA (LZ77 + arithmetic coding) with preprocessing for modeing text, XML tags, dates, and numbers. It may also be used as a preprocessor for input to other compressors. Version 1.0 was strictly a preprocessor without built-in compression.

The -l6 option selects maximum LZMA compression. -b255 selects maximum buffer size of 255 MB for building a dynamic dictionary. -m255 selects maximum memory. -s turns off spaces modeling. -f8 sets the minimum word frequency for dictionary inclusion to 8 (default is 6).

xml-wrt 3.0 (Sept. 14, 2006) includes a stripped-down version of PAQ8 (-l11 option) in addition to LZMA compression.

xwrt 3.2 (Oct. 29, 2007) is a dictionary preprocessor frontend to LZMA, PPMVC and lpaq6 as well as a standalone preprocessor. Option -l14 selects lpaq6 option 9 (1542 MB). -b255 selects 255 MB memory (maximum) for building the dictionary. -m96 selects 96 MB buffer during compression. (Higher values cause out of memory error). -s turns of space modeling. -e40000 limits the dictionary size to 40000 words. -f200 limits the dictionary to words that occur at least 200 times.

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
-------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
xml-wrt 2.0  -l6 -b255 -m255 -s -f8           23,199,202  196,914,328     25,354 s  196,939,682    905    70  525 LZ77
xml-wrt 3.0  -l11 -b255 -m255 -f24            19,663,305  165,274,422     40,447 s  165,314,869   4398  4317  416 CM
xwrt 3.2     -l14 -b255 -m96 -s -e40000 -f200 18,679,742  151,171,364     52,569 s  151,223,933   2537  2328 1691 CM

.1529 nanozip

nanozip 0.01a is a free, experimental, closed source GUI and command line archiver by Sami Runsas, July 14, 2008. For these tests, the command line version (smaller executable) was used. It compresses using several algorithms (fastest to best): LZP (options -cf and -cF), LZ77 (-cd, -cD), BWT (-co, -cO, uses 5N block size) and CM (-cc). The uppercase options (-cF, -cD, -cO) compress better but slower than the corresponding lowercase options and may use more memory. The default compression mode is -co (fast BWT). -m1500m selects 1500 MB memory, although the reported memory usage may differ and the actual memory usage (Cmem, Dmem, in MB) measured with Task Manager is usually lower than reported. The program will use less memory depending on available physical memory when run. -forcemem was used to override this. For all tests, -nm was used to turn off checksums and not store timestamps or file permissions. For -cO, the program uses a LZ77 variant (called LZT) instead of BWT for binary files. -txt is an optimization for text files with -co or -cO.

Program     Options            enwik8      enwik9     zip size      Total     Comp  Deco  Cmem Dmem (reported) Alg
--------  -----------        ----------  -----------  ---------  -----------  ----  ----  ---- ---- ---- ----  ---
nz 0.01a  -cf                46,381,713                                         24    24    96       404  404  LZP
          -cf -m1500m        46,381,713  417,351,980  266,797 x  417,618,777    26    31   975  978 1476 1476  LZP
          -cF                40,733,125                                         62    43   155       404  404  LZP
          -cF -m1500m        40,733,125  359,192,720             359,459,517    63    40  1040 1045 1476 1476  LZP
          -cd                33,241,150                                        127    28    89       422  402  LZ77
          -cd -m1500m        33,001,952  292,180,617             292,447,414   156    28   768  687 1546 1474  LZ77
          -cD                29,384,997                                        288    27   282       466  258  LZ77
          -cD -m1500m        29,253,158  258,513,190             258,779,987   323    31  1020  693 1314  994  LZ77
          -co                21,838,721                                        391   186   333       431  336  BWT
          -co -m1500m        20,503,629  176,470,974             176,737,771   448   221  1667 1160 1810 1294  BWT
          -co -m1500m -txt   20,503,629  170,711,387             170,978,184   336   234  1074 1120 1471 1463  BWT
          -cO                21,623,801                                        465   247   333       431  266  BWT
          -cO -m1500m        20,306,489  174,770,662             175,037,459   511   269  1378 1135 1810 1294  BWT
          -cO -m1500m -txt   20,306,489  169,092,652             169,359,449   393   280  1074 1274 1471 1463  BWT
          -cO -m1670m -txt   20,306,489  167,509,921             167,776,718   403   284  1170 1325 1633 1625  BWT
          -cc                18,994,349                                       2975  2910   360       436  435  CM
          -cc -m1500m        18,723,413  152,654,332             152,921,129  3147  3091  1556 1556 1524 1523  CM

.1563 WinRK

WinRK 3.0.3 is a commercial GUI archiver by Malcolm Taylor (Mar. 6, 2006). It is top ranked on some benchmarks. Unfortunately it is not available for free download (as of May 16, 2006). The "free trial" expires as soon as you install it. (Update, Sept. 11, 2006: versions 3.0.2 and 3.0.3 are no longer available for download. They appear to have been withdrawn last month). WinRK in PWCM mode (Paq Weighted Context Modeling) is based on the paq7/8 algorithm with text dictionary preprocessing and specialized models for wav, bmp, and exe files. Version 3.0.2 was based on the earlier paq6 algorithm which uses adaptive linear model mixing rather than a neural network which mixes bitwise predictions from models in the logistic (log p/(1-p)) domain. The +td and -td options turns English dictionary preprocessing on or off respectively. 800MB selects the memory limit. When not specified, PWCM appears to allocate all available memory except leaving 8 MB.

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Alg  Notes
-------           -------                     ----------  -----------  -----------  -----------  ----- -----  ---  -----
WinRK 3.03        PWCM (800MB +td)            18,612,453  156,291,924  3,017,362 x  159,309,286  68555        CM   10
WinRK 3.03        PWCM                        18,612,551  156,349,910  3,017,362 x  159,367,272 102973~90000  CM    9
WinRK 3.03        FPW1 (800MB +td)            19,035,564                                         24950             10
WinRK 3.03        PWCM (800MB -td)            19,060,620                                         88310        CM   10
WinRK 3.03        Efficient                   21,157,165                                          5380        PPM  10
WinRK 3.03        Normal (PPMd)               22,322,981                                           620        PPM  10
WinRK 3.03        PWCM (800MB +td)            18,612,453  156,291,924     99,665 xd 156,391,589  68555        CM   10

RK and RKC are predecessors of WinRK so I don't plan to test them.

.1570 ppmonstr, ppmd

ppmd and ppmonstr var. J are free command line file compressors by Dmitry Shkarin, Feb. 16, 2006. ppmonstr is a slower, experimental version of ppmd with better compression. Source code is available for ppmd but not ppmonstr. They both use PPMII (PPM with information inheritance). The -m256 option selects 256 MB memory (maximum for ppmd). The -o10 option selects PPM order 10. (Higher orders use up memory faster which hurts compression). When ppmd runs out of memory, it discards the model and starts over. The -r1 option (default in ppmonstr) tells ppmd to back up and partially rebuild the model before resuming compression. The default options for ppmd are -m10 -o4 -r0 which are designed for reasonably good compression with high speed and low memory usage (see table below).

              Compression          Compressed size      Decompressor  Total size   Time (ns/byte)
Program         Options           enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  
-------         -------         ----------  -----------  -----------  -----------  ----- -----  
ppmonstr J      -m1700 -o16     19,055,092  157,007,383     42,019 x  157,049,402   3574 ~3600
ppmonstr J      -m800 -o16      19,230,657  161,496,685     42,019 x  161,538,704   3783 ~3800
ppmd J          -m256 -o10 -r1  21,388,296  183,964,915     26,835 x  183,991,750    880   895
ppmd J          -m10 -o4 -r0    26,275,353  236,509,791     26,835 x  236,536,626    194   206

ppmd was updated to J1 on May 10, 2006 to fix a bug. Compression benchmarks are unchanged except the size of the compressor (11,099 bytes as zipped source code). ppmonstr is unchanged.

.1598 slim

slim 23d is a free, closed source command line archiver by Serge Voskoboynikov, Sept 21, 2004. It uses a PPMII core (ppmd/ppmonstr) by Dmitry Shkarin with filters for special file types including text. The -m700 option selects 700 MB of memory. (I found -m800 causes disk thrashing at 1 GB). The -o10 option selects order 10 PPM. (-o12 and -o16 caused slim to fail on enwik9, creating an empty archive and exiting after about 60% completion with 1 GB. Smaller files were OK. There was no error with 2 GB).

As with other PPM compressors (ppmd, ppmonstr), using a higher order improves compression but consumes memory faster. For enwik8, -o32 is optimal with 700MB available, but lower orders are better for enwik9.

              Compression          Compressed size      Decompressor  Total size   Time (ns/byte)
Program         Options           enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  
-------         -------         ----------  -----------  -----------  -----------  ----- -----  
slim23d         -m1700 -o12     19,077,276  159,772,839     69,453 x  159,842,292   5232 ~5400
slim23d         -m700 -o32      19,226,339  (failed)        69,453 x                6530  6770
slim23d         -m700 -o10      19,264,094  162,529,098     69,453 x  162,598,551   5175  5360

.1640 bbb

bbb ver. 1 is a free, open source (GPL) command line file compressor by Matt Mahoney, Aug. 31, 2006. It uses a memory efficient BWT allowing blocks up to 80% of available memory. The transformed data is compressed with an order 0 PAQ like model: the previous bits of the current byte are mapped first to a bit history, then through a 6 level probability correcting adaptive chain before bitwise arithmetic coding.

The m1000 command selects 1000 MB block size. Thus, enwik9 is suffix sorted in one block. This is accomplished by sorting 16 smaller blocks, writing the pointers to 4 GB of temporary files, and merging them. The inverse transform is done in memory without building a linked list. Rather, the next position is found by looking up the approximate location in an index of size n/16 and finding the exact location by linear search.

bbb.exe Win32 executable compiled with MinGW g++ 3.4.2 and UPX 1.24w.

  g++ -Wall -O2 -Os -march=pentiumpro -fomit-frame-pointer -s -o bbb.exe
  upx bbb.exe

bbb Linux executable, supplied by Phil Carmody (Aug. 31, 2006). Compiled with g++-4.1 -Wall -O2 -o bbb bbb.cpp; strip bbb

bbb has a faster mode for both compression and decompression that does a "normal" BWT using 5x blocksize in memory. Output format is the same for fast and slow mode for both compression and decompression. A file compressed in fast mode can be decompressed in slow mode on another computer with less memory, and vice versa. The mode has no effect on the compressed file contents.

Recommended usage for best compression: For files smaller than 20% of available memory, use fast mode and one block. For example, if you have 1 GB memory (800 MB available under Windows) and foo is 100 MB:

  bbb cfm100 foo foo.bbb  (c = compress, f = fast, m100 = 100 MB blocks)
  bbb df foo.bbb foo.out  (d = decompress, f = fast)
If the file is 20% to 80% of available memory, use one block in slow mode. If foo is 500 MB:
  bbb cm500 foo foo.bbb
  bbb d foo.bbb foo.out
If the file is over 80% of memory, use 80% of memory as the block size in slow mode. If foo is 1 GB:
  bbb cm640 foo foo.bbb
  bbb d foo.bbb foo.out
The model requires about an additional 6 MB that should be subtracted from available memory.

bbb results by block size are shown below. Gain is the compression improvement obtained by using a larger block size. Gain(blocksize) is defined as C(blocksize/10)/C(blocksize) - 1 where C(x) means the compressed size of enwik9 with block size x. Compression times are fast modes for block sizes 10 through 108 and slow mode for 109 on a 2.2 GHz Athlon-64 with 2 GB memory under WinXP Home SP2.

Block   enwik8      enwik9     Gain  Comp ns/b
----  ----------  -----------  ----  ----
101   66,414,034  646,449,572        4359
102   56,241,619  542,912,447  .191  2169
103   45,500,201  435,597,745  .246  1907
104   37,006,646  343,663,203  .267  1802
105   30,946,413  275,172,983  .249  1838
106   26,661,555  233,555,297  .178  2095
107   23,460,457  204,355,672  .142  2499
108   20,847,290  182,162,626  .122  3106
109   20,847,290  164,032,650  .110  4524

.1651 paq9a

paq9a is a free, open source, command line archiver by Matt Mahoney, Dec. 31, 2007. It is a context mixing compressor with an LZP preprocessor to improve speed for highly redundant files. Matches to a context length of 12 or more are coded as 1 bit, and literals as 9 bits. Context mixing differs from paq8 in that it uses a chain of 2-input mixers rather than one mixer with many inputs. It mixes sparse order-1 contexts with gaps of 3, 2, 1, 0, then orders 2 through 6, then text word orders 0 and 1. Option -9 selects maximum memory.
        Compression     Compressed size      Decompressor  Total size   Time (ns/byte)
Program  Options      enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------  -------    ----------  -----------  -----------  -----------  ----- -----  --- ---
paq9a    -9         19,974,112  165,193,368     13,749 s  165,207,117   3997  4021 1585 CM

.1662 uda

uda 0.300 is a free, experimental file compressor by dwing, July 16, 2006. It is a modification of PAQ8H with optimizations for speed. It takes no options. The decompressor size is for uda.exe, since this is smaller than the corresponding zip file.

.1664 nanozipltcb

nanozipltcb is a free file compressor by Sami Runsas, July 25, 2008. It uses BWT. It takes no options. It is a customized version of nanozip, similar to -cO -txt -m1700m, but tuned to this benchmark. Files compressed with nanozipltcb are not compatible with nanozip.

             Compression            Compressed size      Decompressor  Total size   Time (ns/byte)
Program        Options             enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------       ----------------   ----------  -----------  -----------  -----------  ----- -----  --- ---
nanozip 0.01  -cO -m1670m -txt   20,306,489  167,509,921    266,797 x   167,776,718   403   284 1325 BWT
nanozipltcb                      20,494,670  166,251,135    239,124 x   166,490,259   348   185 1729 BWT

.1727 cmm4

cmm1 is a free, open source (GPL) file compressor by Christopher Mattern, Sept. 18, 2007. It uses context mixing with LZP preprocessing.

cmm2 was released Dec. 10, 2007 without source code.

cmm2 080113 was released Jan. 13, 2008 without source code.

cmm3 080207 (test release) was released Feb. 7, 2008 without source code.

cmm4 v0.0 (test release) was released Mar. 14, 2008 without source code.

cmm4 v0.1e was released Apr. 20, 2008 without source code. It takes a 2 digit option "wm" (e.g. 96 meaning w=9, m=6). Memory usage is 2w MB for a sliding window, and 12*2m MB for a context mixing model (order 1,2,3,4,6). On my machine m=7 caused disk thrashing.

Description by the author: CMM4 0.1e Is a variable order context mixing coder, it predicts using the four "highest" (ranking: 643210) models in each bit coding step and, in addition, the match model input. Orders 0 and 1 are implemented using a table lookup, all higher orders use nibble based hashing. Matches are found using order 4 and 6 LZP, the pointers and a quick exclusion hash are stored within the model's hashing tables. The mixer joins the 4 (or 5 in presence of a match model) predictions and outputs them to a SSE stage. A mixer (similar to (L)PAQ) is selected based on the last byte's 4 MSBs and on the coding order. The SSE context is made of an order 0 context and qunatized combination of the previous symbol rank, the match length and partially matched symbol. This results in a notable compression increase on redundant data. The model's counters are quantized using the PAQ's state machine since CMM4 (will be replaced). Despite the use of hashing most data structures are tuned to never cross a cache line per nibble (the models) or octet (the mixer) (only SSE does). The core compression performance is equivalent to LPAQ1/2, while being faster. In addition there's a filter framework, which currently implements an x86 transform and will be extended.

Compression           Compressed size      Decompressor  Total size   Time (ns/byte)
Program      Opt     enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------      ---   ----------  -----------  -----------  -----------  ----- -----  --- ---
cmm1               23,495,627  207,266,867     18,785 x  207,285,652   1165  1198   50 CM
cmm2               23,477,008  208,268,161     17,901 x  208,286,062   1756  1849   32 CM
cmm2 080113        22,303,128  191,477,052     18,263 x  191,495,315   2180  2127  329 CM
cmm3 080207        21,212,766  179,633,451     18,700 x  179,652,151   2328 ~2609  395 CM
cmm4 v0.0          21,459,665  186,395,591     18,042 x  186,413,633   1807  1849  116 CM
cmm4 v0.1e   96    20,569,034  172,669,955     31,314 x  172,701,269   2052  2056 1321 CM

.1741 ccm

ccm 1.03a is one of 3 versions of a free file compressor by Christian Martelock, Feb. 11, 2007. It uses context mixing. The 3 versions are ccm (fastest, uses 17 MB memory), ccm_high (slower but better compression), and ccm_extra (best compression, uses 100 MB memory). The programs take no options.

ccm 1.1.1a (Feb. 23, 2007) has only one version.

ccm 1.1.2a (Mar. 2, 2007) includes a ccm_low version using less memory, which was not tested.

ccm 1.20a (Mar. 21, 2007) has only one version.

ccm 1.20d (Apr. 8, 2007) has two versions: ccm using 99MB memory and ccmx using 210 MB for better compression. Only ccmx was tested.

ccm 1.21 (mirror) (Apr. 22, 2007) includes an option to select memory usage. 7 selects maximum memory, 1300 MB. Only the high compression version (ccmx) was tested.

ccm 1.30 (mirror) was released Jan. 7, 2008. Only ccmx 7 (high compression version, maximum memory) was tested.

Compression           Compressed size      Decompressor  Total size   Time (ns/byte)
Program              enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------            ----------  -----------  -----------  -----------  ----- -----  --- ---
ccm       1.0.3a   27,667,346  240,296,736      7,217 x  240,303,953    676   679   17 CM
ccm_high  1.0.3a   25,412,726  221,177,776      7,229 x  221,185,005   1119  1171   17 CM
ccm_extra 1.0.3a   24,027,805  207,273,926      7,230 x  207,281,156   1341  1353  100 CM
ccm       1.1.1a   22,824,629  197,271,467      9,019 x  197,280,486   1247  1252   82 CM
ccm       1.1.2a   22,675,768  195,965,427      8,502 x  195,973,929   1161  1183   83 CM
ccm       1.20a    21,350,295  182,784,655     13,346 x  182,798,001   1794  1801  210 CM
ccmx      1.20d    21,310,303  182,379,461     13,468 x  182,392,929   1383  1485  210 CM
ccmx 7    1.21     20,819,656  174,161,536     21,139 x  174,182,675   1521  1493 1324 CM
ccmx 7    1.30     20,857,925  174,142,092     15,014 x  174,157,106   1313  1338 1332 CM

.1749 epmopt | epm

epmopt + epm r9 is an experimental, closed source command line optimizer and file compressor by Serge Osnach, Oct. 16, 2003. It was intended for enc r16, but development on that project has stopped at enc r15, according to the web page (in Russian). The program has two parts: epm, a PPM compressor with text preprocessing, and epmopt, which attempts to optimize the parameters to epm by compressing repeatedly and varying the options one at a time until there is no more improvement. The input to epmopt may be different than epm, and supports optimization on sets of files matching patterns in specified sets of directories. The options to epm are memory limit, PPM order, and 20 undocumented options each specified by a single digit. The exact same options must be passed to the decompressor. In the results, I added 27 bytes to the compressed file sizes to account for this information. enwik9 was compressed and decompressed as follows:

  epmopt -m800 -n20 --fixedorder:12 enwik6 .
  epm c01286014321245957352513 enwik9 enwik9.epm -m800
  epm d01286014321245957352513 enwik9.epm enwik9.tmp -m800
The optimization data was enwik6, the first 106 bytes of the input file. epmopt compressed this about 100 times in 368 seconds with different options, making 35 passes through the list of 20 undocumented parameters, adjusting each one up or down one at a time. The fixed parameters were -m800 (800 MB memory limit) and PPM order 12 (--fixedorder:12, also the first 3 digits of the parameter string. Allowing epmopt to set the PPM order on a smaller training file will cause it to choose too large a value, hurting compression. I only tested orders 10, 12, and 20 on enwik8 and 12 gave the best compression). The -n20 option tells epm to tune all 20 parameters. The parameter string is written to the file enc.ini. The -m800 option need not be the same for epmopt and epm but must be the same for epm during compression and decompression.

Warning: epm failed to decompress correctly on enwik7 (first 107 bytes). In the output, some linefeeds were changed to spaces. This happened with all parameter combinations I tested including defaults: epm c enwik7 enwik7.epm. Decompression was bit-exact for enwik5, enwik6, enwik8 and enwik9.

.1749 WinUDA

WinUDA 0.291 is a free, closed source GUI archiver by dwing, July 4, 2005. It uses context mixing and is derived from paq6. Mode 3 is the slowest (about 3x slower than mode 0) and uses the most memory, 194 MB.

.1755 dark

dark v0.51 is a free, closed source archiver by Malyshev Dmitry Alexandrovich, Jan. 2, 2007. It uses BWT + distance coding without preprocessors. The -b333m option selects 333 MB blocks. -f (-f0 in 0.40 and 0.46, not supported in 0.32) forces no segmentation. Memory usage is 5 times the block size for compression (6x prior to v0.46).

opendark ver. A is an open source version of dark. The supplied Windows dark.exe crashed when decompressing enwik9 (size is 177,675,818). opendark does not support the -f option.

                             Compression      Compressed size      Decompressor  Total size   Time (ns/byte)
Program                        Options       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg 
-------                        -------     ----------  -----------  -----------  -----------  ----- -----  --- --- 
dark 0.32b  July  9, 2006      -b128m      21,414,479  185,844,554     31,076 x  185,875,590    481   407  790 BWT
dark 0.40b  Aug. 14, 2006      -b128mf0    21,243,259  184,271,115     34,688 x  184,305,803    471   316  790 BWT
dark 0.46   Aug. 23, 2006      -b160mf0    21,231,325  181,904,374     40,780 x  181,945,154    488   404  813 BWT
                               -b333mf0    21,231,325  175,955,412     40,780 x  175,996,192    432   425 1692 BWT
opendark A  Nov. 14, 2006      -b333m      21,432,727    (fails)       10,089 s                 450   390 1692 BWT
dark 0.51   Jan.  2, 2007      -b333mf     21,169,819  175,471,417     34,797 x  175,506,214    533   453 1692 BWT

.1760 FreeArc

FreeArc 0.36 is a free, open source archiver by Bulat Ziganshin, Feb. 21, 2007. It incorporates 7 compression libraries - PPMd, GRZipII, LZMA (7zip), plus BCJ (7zip), REP (rzip-like), dynamic dictionary and LZP preprocessors. The option -m9 selects maximum compression (dict + LZP + PPMd for text files, REP+LZMA for binary). -lc1600000000 limits memory to 1.6 GB (same as -lc1600m). There is an option to use ppmonstr as an external compressor, which was not included in the test.

FreeArc 4.0 pre-4 is a free, open source archiver by Bulat Ziganshin, Dec. 16, 2007. It compresses using ppmd, GRZipII, and LZMA along with multimedia filters, a dictionary preprocessor and a REP preprocessor for removing repeating strings. It has Windows and Linux versions and an optional GUI.

ppmd generally gives the best compression for text. It will also call ppmonstr as an external program, but this mode was not tested, even though it compresses better.

For this test, the Windows command line version was tested. The option -mppmd:1012m:o13:r1 is equivalent to ppmd -m1012 -o13 -r1, selecting 1012 MB memory, order 13, and partial reinitialization of the model when memory is exhausted. Note that ppmd normally allows only up to -m256. This program was tested with 2 GB memory but values higher than -m1012 caused the program to crash during compression.

                             Compression      Compressed size      Decompressor  Total size   Time (ns/byte)
Program                        Options       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg 
-------                        -------     ----------  -----------  -----------  -----------  ----- -----  --- --- 
FreeArc 0.36        -m9 -lc1600000000      21,153,231  184,498,111    372,457 s  184,870,568    665   517 1600 PPM
FreeArc 0.40 pre-4  -mppmd:1012m:o13:r1    20,931,605  175,254,732    748,202 x  176,002,934   1175  1216 1046 PPM

.1778 hook

hook v0.2 is a free, open source (GPL) command line file compressor by Nania Francesco Antonio, Jan. 8, 2007. It uses DMC: a state machine in which each state represents a bitwise context. Each state has 2 outgoing transitions corresponding to next bits 0 and 1, and a count n0 or n1 associated with each transition. Bit y (0 or 1) is compressed by arithmetic coding with probability ny/(n0+n1) (where ny is n0 or n1 according to y), and then ny is incremented.

After each input bit, the next state represents a context obtained by appending that bit on the right and possibly dropping bits on the left. States are cloned (copied) whenever the incoming and outgoing counts exceed certain limits. This has the effect of creating a new context in which no bits are dropped. In the example below, the state representing context 110 (dropping 2 bits from the previous context) is cloned by creating a new state 11110 because the incoming 0 transition count (ny for y=0) from state 1111 exceeded a limit. The new context is longer because it does not drop any bits. This transition is moved to point to the new state. Other incoming transitions (not shown) remain pointing to the original state. The outgoing transitions are copied. The counts of the original state are distributed to the new state in proportion to the moved transition's contribution to those counts, which is w = ny/(n0+n1).

                n0 ----> 1100           n0*(1-w) ----> 1100
         ny       /                             /     /
   1111 -----> 110               1111        110     /
        (y=0)     \                 |           \   /
                n1 ----> 1101       |   n1*(1-w) ----> 1101
                                    |             /    /
                                    |     n0*w   /    /
                                    | ny        /    /
                                    +----> 11110    /
                                                \  /
                                          n1*w   --

        Before cloning            After cloning 110 to 11110

Normally, the initial set of contexts begin on byte boundaries. The cloning mechanism ensures that new contexts also have this property.

In hook v0.2, the counts are 32 bit floating point numbers initialized to 0.1. The initial state machine has 256*255 states representing bytewise order 1 contexts with uniform statistics. When memory is exhausted, the model is discarded and the state machine is reinitialized. A new state is cloned when ny > limit and n0+n1-ny > length, where limit and length are parameters. The optimal parameters for enwik8 and enwik9 are "c 7 2 6", c means compress, 7 selects the maximum of 1 GB memory (64M states at 16 bytes each, minimum is 8 MB memory), 2 is the limit (range 1 to 7), and 6 selects a length of 32 (possible values are 1, 2, 3, 4, 8, 16, 32, 64). Larger lengths are better for large files because they conserve memory at the expense of compression.

hook v0.3 (Jan. 11, 2007) allows up to 1.8 GB memory (first option = 9) and uses double precision predictions in the 32 bit arithmetic coder.

hook v0.3a (Jan. 12, 2007) initializes the counts to 0.125 (instead of 0.1) and uses 24 bit precision in the arithmetic coder (instead of 32 bit).

hook v0.4 (Jan. 15, 2007) initializes counts to 0.1. Argument 2 selects length 3 (not 2).

hook v0.5b (Jan. 22, 2007) adds an LZP preprocessor. If the next byte to be coded is the same as the byte that occurred in the last matching 3 byte context, then this is indicated by coding a flag bit in an order 3 model (32 MB memory), and a match length coded by DMC with a fixed size of 128 MB. If there is no match, then the literal byte is coded by another variable sized DMC model. The parameters "c 1600000000 2 64 1 6" select compression (c), 1.6 GB for the DMC literal model (1600000000), a limit of 2 (minimum count for the cloned state), length of 64 (minimum remaining count for the state to be cloned), LZP selected (1), and a minimum match length of 6.

hook v0.6 (Feb. 7, 2007) removes the "length" parameter (effectively infinite). The arguments "c 1600 4 1 6" mean to compress (c), use 1600 MB memory, set the "limit" parameter to 4, turn on LZP preprocessing (1) with a minimum match length of 6. The "limit" parameter is the minimum count for an outbound DMC state transition to clone the state. Limit was tuned on enwik8.

hook v0.6b (Feb. 8, 2007) includes support for files up to 264 bytes (compiled by Ilia Muraviev. Earlier versions were compiled with MinGW g++ 3.4.5 by Matt Mahoney.) "limit" was tuned on both enwik8 and enwik9. Higher values conserve memory at the expense of compression on smaller files.

hook v0.6c (Feb. 14, 2007) stores the input filename in the compressed file and uses it during decompression.

hook v0.7 (Mar. 10, 2007) uses 325 MB more memory than advertised so it was tested with a lower option.

hook v0.7b (Mar. 12, 2007) reduces the excess memory to 94 MB.

hook v0.8 was released Mar. 17, 2007. Some additional results on enwik9 decreasing the rate at which the state machine fills up and is flushed:

hook08 params    enwik9
------------  -----------
c 1700 1 1 6  183,175,857
c 1700 2 1 6  181,578,888
c 1700 3 1 6  181,220,553
c 1700 4 1 6  181,268,867
c 1700 5 1 6  181,197,310
c 1700 6 1 6  181,567,697
c 1700 7 1 6  181,813,763
c 1700 8 1 6  182,360,391

hook v0.8b (Mar. 18, 2007) has some LZP improvements.

hook v0.8c (Mar. 19, 2007) is a minor bug fix. Compressed sizes are 1 byte larger than v0.8b.

hook v0.8d was released Mar. 21, 2007.

hook v0.8e was released Mar. 27, 2007.

hook v0.9 (Apr. 6, 2007) is closed source. It requires a processor that supports SSE instructions. It has some speed improvements and a E8/E9 filter for improved compression of .exe files. Memory usage is the second argument + 60MB.

freehook 0.2 is an open source port of hook v0.8e from C++ to C by Eugene Ortmann, Apr. 7, 2007. The supplied .exe file requires SSE instructions (Pentium 3 or higher), but the source can be recompiled for other processors.

hook v0.9b (Apr 10, 2007) replaces floating point arithmetic with integer arithmetic, so that archives are compatible across different processors. Note: I reduced the memory setting from 1800 to 1700 to prevent disk thrashing, which was a problem in earlier tests. I will do this from now on. This hurts enwik9 compression (but not enwik8) slightly, from 180,444,546 to 180,582,601. Actual memory usage is 60 MB over.

freehook 0.3 (Apr 10, 2007) has only very minor changes from 0.2 but is slightly faster due to different g++ compiler options. Compression is the same as 0.2. Memory usage is about 160 MB over.

hook v0.9c (May 8, 2007) has some speed improvements in the arithmetic coder. It compresses the same size as v0.9b.

hook v1.0 (Sept. 20, 2007) is closed source. The only option is memory size in MB.

The zip file linked above contains all versions (C++ source and Win32 .exe).

hook 1.1 (Nov. 13, 2007) improves BMP and WAV compression.

hook 1.3 was released Dec. 14, 2007, modified Dec. 15, 2007.

Compression                             Compressed size      Decompressor  Total size   Time (ns/byte)
Program       Options                  enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------       -------                ----------  -----------  -----------  -----------  ----- -----  --- ---
hook v0.2     c 7 2 6                23,628,061  208,211,084      2,556 s  208,213,640    772   779 1052 DMC
hook v0.3     c 9 2 6                23,548,017  202,024,740      3,567 s  202,028,307    849   864 1764 DMC
hook v0.3a    c 9 2 6                23,499,700  201,934,976      3,555 s  201,938,531    862   832 1764 DMC
hook v0.4     c 9 2 6                23,349,695  199,829,234      4,112 s  199,833,346    934   959 1764 DMC
hook v0.5b    c 1600000000 2 64 1 6  22,806,402  193,227,085      5,113 s  193,232,198   1084  1029 1764 LZP+DMC
hook v0.6     c 1600 4 1 6           22,472,884  191,733,561      5,112 s  191,738,673   1146  1034 1600 LZP+DMC
hook v0.6b    c 1600 4 1 6           22,535,069  189,932,778      5,174 s  189,937,952   1040       1600 LZP+DMC
              c 1600 6 1 6           22,776,927  188,384,238      5,174 s  188,389,412   1090  1026 1600
hook v0.6c    c 1600 6 1 6           22,561,621  188,081,694      5,878 s  188,087,572   1131  1092 1600 LZP+DMC
hook v0.7     c 1000 6 1 6           22,410,669  191,516,313      6,195 s  191,522,508   1360  1353 1375 LZP+DMC
hook v0.7b    c 1700 6 1 6           22,404,817  184,765,030      6,195 s  184,771,225   1516  1655 1794 LZP+DMC
hook v0.8     c 1700 5 1 6           22,290,033  181,197,310      6,686 s  181,203,996   1110  1118 1700 LZP+DMC
hook v0.8b    c 1700 5 1 6           22,399,354  180,335,788      6,944 s  180,342,732    988  1033 1700 LZP+DMC
hook v0.8c    c 1700 5 1 6           22,399,355  180,335,789      7,071 s  180,342,860   1043  1005 1700 LZP+DMC
hook v0.8d    c 1700 5 1 6           22,399,027  180,319,203      7,037 s  180,326,240    928   915 1700 LZP+DMC
hook v0.8e    c 1700 3 1 6           22,039,935  178,140,788      7,263 s  178,148,051    952  1009 1700 LZP+DMC
hook v0.9     c 1800 2 1 6           21,969,342  178,932,435     10,069 x  178,942,435    869       1860 LZP+DMC
              c 1800 3 1 6           22,077,883  178,599,478     10,069 x  178,609,547    833   916 1860 LZP+DMC
freehook 0.2  c 1700 3 1 6           22,039,914  178,141,036      7,386 s  178,148,422    813   855 1860 LZP+DMC
hook v0.9b    c 1700 3 1 6           22,496,910  180,582,601      9,278 x  180,591,879    810   810 1721 LZP+DMC
freehook 0.3  c 1600 3 1 6           22,039,914  178,619,149      7,352 s  178,626,501    789   818 1713 LZP+DMC
hook v0.9c    c 1700 3 1 6           22,496,910  180,582,601      8,506 x  180,591,107    774   791 1721 LZP+DMC
hook v1.0     c 1700                 22,122,484  177,843,658     11,163 x  177,854,821    865   879 1739 LZP+DMC
hook v1.1     c 1700                 22,122,484  177,843,658     25,854 x  177,869,512    877   872 1739 LZP+DMC
hook v1.3     c 1700                 22,030,108  178,216,980     13,870 x  178,230,850    825   835 1736 LZP+DMC

.1789 7zip

7zip 4.42 is an open source GUI and command line archiver by Igor Pavlov, May 14, 2006. It compresses to 7z, zip, gzip, ppmd.H and tar format, optionally encrypts with AES, and will uncompress several other formats.

7z is the default format. It uses LZMA compression, a variation of LZ77. The option -mx=9 selects ultra (maximum) compression in this mode. The option -sfx7zCon.sfx creates a console-based self extracting executable by prepending a 131,584 byte decompressor. This is slightly smaller than the Windows GUI version (132,096 bytes) and much smaller than the decompression program itself as a zipped self extracting download (817,795 bytes). The best compression is with ppmd. The options are -m0=ppmd:mem=768m:o=10 equivalent to ppmd var H (with minor changes) order 10 with 768 MB memory.

The following include the best known option combinations for 7zip on enwik8 in ppmd (PPM), 7z (LZMA), bzip2 (BWT) and zip (LZ77) formats.

                Compression                         Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                          enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Alg  Notes
-------           -------                        ----------  -----------  -----------  -----------  ----- -----  ---  -----
7zip 4.42 -m0=ppmd:mem=768:o=10 -sfx7xCon.sfx    21,375,060  185,043,783          0 xd 185,043,783    505  ~500  PPM
7zip 4.42 -m0=ppmd:mem=293m:o=7                  21,791,628                                           647   655  PPM   6
7zip 4.42 -mx=9 -sfx7zCon.sfx                    24,996,113  213,490,979          0 xd 213,490,979   2286    63  LZMA
7zip 4.42 -tbzip2 -mpass=2                       29,003,844                                          1974   176  BWT   6
7zip 4.42 -tzip -mm=deflate64 -mfb=153 -mpass=8  33,727,442                                          2803    28  LZ77  6
7zip 4.42 -tzip -mm=deflate -mfb=171 -mpass=8    35,056,389                                          2672    27  LZ77  6
7zip 4.42 -tzip -mm=deflate -mfb=258 -mpass=8    35,057,040                                          2664    29  LZ77  6
7zip 4.42 Zip/Ultra (in GUI)                     35,057,347                                          4307        LZ77  1
7zip 4.46a -m0=ppmd:mem=1630m:o=10 -sfx7xCon.sfx 21,197,559  178,965,454          0 xd 178,965,454    503   546  PPM
7zip 4.46a was announced May 21, 2007. (The improved compression is due to testing with more memory).

.1789 M99

M99 (mirror) is a free file compressor by Michael Maniscalco, originally written in 1999 and ported to Windows on Mar. 27, 2007. It uses BWT, based on MSufSort 3.1. M99 is a predecessor to M03. Command line is:

M99.exe e|d -switches blocksize input output 

switches are:
-r = post BWT run length encoding
-a = arithmetic coding instead of M99 style bit packing
-f = fast mode
-m = max compression mode (implies -a).
Blocksize can be specified in bytes (like 10000), kb, mb etc as 100m or 100k. Memory requirement for compression is 6 times the blocksize maximum, although in most cases only a little over 5 times blocksize is used. Blocksize 239m divides enwik9 into 4 approximately equal parts and requires about 1500 MB memory.

Version 2.1 was released Apr. 19, 2007.

M99 2.2.1, released July 18, 2008, has an optimization to compress the contents of TAR files separately. For other files, it increases the size by 1 byte.

                Compression        Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options        enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg
-------           -------      ----------  -----------  -----------  -----------  ----- -----  --- ---
M99               e -m 239m    21,431,211  180,477,144     67,697 x  180,544,841    674   496 1500 BWT
M99 v2.1          e -m 239m    21,251,170  178,910,174     68,052 x  178,978,226    713   535 1500 BWT
M99 v2.2.1        e -m 239m    21,251,171  178,910,175     72,245 x  178,982,420    704   520 1500 BWT

.1803 pimple2

pimple 1.43 beta is a free, closed source GUI archiver by Ilia Muraviev, Apr. 24, 2006. It uses context mixing.

pimple2 is a command line file compressor, June 11, 2007.

                Compression                      Compressed size      Decompressor  Total size   Time (ns/byte)
Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
-------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
pimple 1.43 beta  512MB, order 8, match 32    20,992,830  181,998,817    353,472 x  182,352,259   9638 10112  512 CM    3
pimple2           (none)                      20,871,457  180,251,530     78,642 x  180,330,172  18474 17992  128 CM          

.1807 ash

ash 04a is a free, experimental command line file compressor by Eugene D. Shelwien, Dec. 5, 2003. The /m700 option selects 700 MB memory limit. (/m800 causes disk thrashing with 1 GB). /o10 selects model order 9. This gives good results on smaller files when memory is constrained, but I did not try to o