/* PAQ1 - File archiver and compressor.
(C) 2002, Matt Mahoney, mmahoney@cs.fit.edu
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA
or see http://www.gnu.org/licenses/gpl.txt
USAGE
To compress: PAQ1 archive file file... (1 or more file names), or
or (MSDOS): dir/b | PAQ1 archive (read file names from input)
or (UNIX): ls | PAQ1 archive
To decompress: PAQ1 archive
To list contents: more < archive
Compression: The files listed are compressed and stored in the archive,
which is created. The archive must not already exist. File names may
specify a path, which is stored. If there are no file names on the command
line, then PAQ1 prompts for them, reading until the first blank line or
end of file.
Decompression: No file names are specified. The archive must exist.
If a path is stored, the file is extracted to the appropriate directory,
which must exist. PAQ1 does not create directories. If the file to be
extracted already exists, it is not replaced; rather it is compared with
the archived file, and the offset of the first difference is reported.
It is not possible to add, remove, or update files in an existing archive.
If you want to do this, extract the files, delete the archive, and
create a new archive with just the files you want.
FILE FORMAT
The archived file names are stored in a readable header as follows:
PAQ1\r\n
size name\r\n
size name\r\n
\032\f\0
where "size" is the original number of bytes, as a 10 digit decimal number,
right justified with leading spaces, "name" is the file name (as input,
including path if any), \r = carriage return, \n = linefeed, \f = formfeed,
\032 is the DOS end of file character, and \0 is a NUL.
The header is followed by the compressed data in binary format. The
input files are concatenated and treated as a single data stream. Some
improvement in compression is possible by grouping related files together,
for instance, all text files followed by all binary files.
Data is compressed using a bit stream predictive arithmetic encoder.
An input string s is encoded as a binary fraction x such that for any
randomly chosen string r of the same length:
p(r < s) <= x < p(r <= s) (1)
Such an x with length at most log2 p(s) + 1 bits can always be
found. It is convenient to estimate p(s) = p(s1 s2 s3 ... sn)
= p(s1) p(s2 | s1) p(s3 | s1 s2) ... p(sn | s1 s2...sn-1)
and to narrow the range of possible x (initially 0 to 1) as each
probability term is computed. As the leading bits of x become
known (because they are the same at the lower and upper bounds), they
can be output. The archive is a binary representation of x in base
256 with the most significant digit (256^-1) first. The ordering
of s is as a binary number of n bits (n/8 bytes) with the most
significant bit as the most significant bit of the first byte.
The archive is decompressed by finding s that satifies (1) given x.
We make an identical series of probability estimates p(si | s1...si-1)
where each si is one bit (0 or 1), then use our knowledge of x
to decide which it is. After 8 predictions, one byte of uncompressed
data is output.
Compression performance depends on the quality of prediction.
A predictor consists of a set of models that output a weighted
probability that the next bit (si) will be a 0 or 1, given all of the
previous input (s1 s2 ... si-1). The weighted probability is
expressed as counts of 0s and 1s (n0 and n1), where p(1) = n1/(n0+n1)
with weight (confidence) n0+n1. Models are combined by adding
their respective counts, which has the effect of a weighted averaging.
PAQ1 combines the outputs of 5 models:
m0 - A bland model in which p(1) = 1/2 with confidence 2, i.e.
n0 = 1, n1 = 1. By itself, this model would result in no compression.
m1 - NonstatioaryPPM. This is a combination of N = 8 models of
order n = 1 through N. In an order n model, the next bit is
predicted in the context of the last n - 1 bytes, plus the last
0 to 7 bits of the current byte. Each time a 0 or 1 occurs in some
context, the counts n0 and n1 associated with that context is incremented.
If the opposite count is more than 2, then the excess is halved. This
is a compromise between a stationary model, in which all outcomes in
the same context are equally significant, and an inverse temporal
model, in which the probability of an event is inversely proportional
to the time since it last occurred, regardless of anything that happened
previous to that.
For example, suppose that a context occurs 10 times, and the sequence
of bits that follow the context is 1111111000. If the outcomes are
independent, then n0 = 7, n1 = 3, and p(1) = 7/10 with confidence 10.
In an inverse temporal model, the last 1 was 4 events ago and the last
0 was 1 event ago, so 0 is 4 times more likely than 1, for p(1) = 1/5.
PAQ1 uses a compromise between these two positions:
1111111 n0 = 0 n1 = 7 p(1) = 1 with confidence 7
11111110 n0 = 1 n1 = 4 p(1) = 4/5 with confidence 5
111111100 n0 = 2 n1 = 3 p(1) = 3/5 with confidence 5
1111111000 n0 = 3 n1 = 2 p(1) = 2/5 with confidence 5
The counts are weighted by n^2 (order squared). The rationale is
that if the last n bytes of context match, then the next byte will
match with probability (n - 1)/n (in an inverse temporal model),
so we weight the counts by n. Also, the expected number of matches
of length n is proportional to 1/n (given the above probability), so
we weight the counts by n again.
The counts are stored in a hash table with 16M entries. Each entry
takes 2 bytes for a total of 32 MB of memory: an 8 bit hash checksum
to detect collisions, and an 8 bit state representing the two counts
n0 and n1. The context is hashed to 32 bits, with 24 bits used to
index the hash table and 8 bits used to detect collisions. If a
collision occurs, then the next two table elements are tried.
If all 3 are occupied, then the one with the lowest n0 + n1 is removed.
The 8-bit state represents counts of 0 to 10 exactly, and counts
up to 512 as an approximation with probabilistic increments.
Since n0 and n1 can't both be large at the same time, no states are
needed for such pairs.
m2 - MatchModel. This is an approximation of the NonstationaryPPM
model for orders longer than n = 8. It searches the previous
4M bytes of input for a matching context of 8 or more bytes (plus
the current 0-7 bits), and if found, predicts whatever bit followed next
with weight 3n^2 (instead of n^2 since there might be more matches,
but only the last one is used). It uses a hash table of 1M 3-byte
pointers (always overwriting the previous entry) into the 4MB buffer
to find matching contexts, for a total 7 MB of memory.
m3 - WordModel. This is an optimization for English text. There
are two models, and order n = 1 word model and an order n = 2 word model.
In the first, the context is the most recent word or partial word,
plus the previous and current bytes (8 to 15 bits). A word is defined
as a sequence of letters (a-z), case insensitive. The second model
includes the previous word in the context. The weight is
(n + w)^2 where w is the length of the word or words (i.e. the
square of the length of the context in bytes as in m1). The counts
are stored in a 4M hash table (8 MB memory) as in model m1.
m4 - CyclicModel. This is an optimization for data with fixed length
records or rows, such as databases, tables, and binary numeric data.
If the record length can be determined, then two models are used.
The first is an order 1 bit model (no context) with weight 4 in which
the history is all bits in the same column. The second is an order
9-bit model with weight 16 in which the context is the previous 8 bits
(spanning a byte boundary), i.e. the history is all those bits which
have the same 8 bit context as the current bit being predicted. The
bit counts are stored in a 32K (64 KB) hash table.
The record length is determined by finding a pattern of 4 consecutive
8 bit sequences at equal intervals of at least 9 bits. (It does not
have to be an integral number of bytes). If a pattern of n equally
spaced sequences is found at interval r, then n more records are
assumed before the model is expired. A new cycle length of r
may replace an existing one if the new product rn exceeds the old
one (with n rows remaining until the model expires).
PERFORMANCE
The following results were obtained by PAQ1 and various popular
and top ranked compression programs on the Calgary corpus. For
each program, two tests were performed. In the first, the 14 files
were compressed and the total size is reported (either as an archive
or 14 compressed files, depending on the program). In the second
test, the 14 files were concatenated (alphabetically by file name)
into a single file of size 3,141,622 bytes and compressed. The
resulting sizes are shown. Run time is shown for the first test
but are about the same for both. The test was performed on
a 750 MHz Duron with 256 MB memory, 128K L2 cache, running Windows Me.
PAQ1 was compiled using DJGPP (g++) 2.95.2 with optimization (gxx -O).
For all other programs, options are selected for maximum compression
within 64 MB of memory.
Program Options 14 files Seconds Concatenated
------- ------- -------- ------- ------------
compress 1,272,772 1.5 1,318,269
pkzip 2.04e 1,032,290 1.5 1,033,217
gzip 1.2.4 -9 1,017,624 2 1,021,863
P5 992,902 31.8 same
P6 841,717 38.4 same
P12 831,341 38.8 same
bzip2 1.0.0 -9 828,347 5 859,448
acb u 766,322 110 769,363
boa 0.58b -m15 751,413 44 769,196
ppmd H e -m64 -o16 744,057 5 759,674
sbc 0.910 c -m3 740,161 4.1 819,016
ppmonstr H e -m64 -o1 719,922 13 736,899
rk 1.04 -mx3 -M64 -ts 712,188 36 755,872
rk 1.02 b5 -mx3 707,144 64 750,740
PAQ1
m1, 1 MB 843,819 46.3 same for all
m1, 8 MB 768,463 45.9
m1, 16 MB 758,456 45.6
m1, 32 MB 751,734 45.3
m1-m2, 32+7 MB 731,637 48.5
m1-m3, 32+7+8 MB 723,092 62.1
m1-m4, 16+7+8 MB 720,310 71.2
m1-m4, 32+7+4 MB 717,766 68.0
m1-m4, 32+7+8 MB (as coded) 716,704 68.1
compress, pkzip, and gzip are LZ encoders, encoding strings as pointers
to previous occurrences or to dictionary entries. bzip2 and sbc are
Burrows-Wheeler (context sorting) encoders. P5, P6, and P12 use
neural network prediction and arithmetic encoding. acb, boa, ppmd,
ppmonstr, and rk are stationary PPM (longest context matching) encoders.
rk is the best of 157 compressors rated by ACT as of 6/24/2001 (Gilchrist,
see http://compression.ca/act-calgary.html ), followed by sbc, boa, and acb.
On the Canterbury corpus, ppmonstr, ppmd, and rk are the top three
(see http://www.geocities.com/SiliconValley/Bay/1995/texts22.html ).
PAQ1 beats all other comprssors when the Calgary corpus is concatenated
into a single file. Compression is the same with or without
concatenation because PAQ1 (and P5, P6, P12) always concatenate the
files to take advantage of cross file statistics. Concatenation tends to
hurt stationary models such as PPM and Burrows-Wheeler because the
Calgary corpus contains a mix of text and binary files
with different statistics and the models do not adapt.
Results for RK -mx3 and PAQ1. RK is a PPMZ-based program which
rearranges the file order to optimize compression across files.
For consistency, the PAQ1 test was run with the files in the same
order as RK rather than alphabetically as previously.
File Size RK -mx3 Ratio PAQ1 Ratio
------ ------ ------- ----- ------ -----
PAPER2 82199 22111 26.8 22323 27.16
PAPER1 53161 13085 24.6 12783 24.05
BOOK1 768771 201950 26.2 206539 26.87
BOOK2 610856 139591 22.8 132151 21.63
OBJ1 21504 9397 43.6 10027 46.63
GEO 102400 47960 46.8 52244 51.02
OBJ2 246814 63754 25.8 66416 26.91
PIC 513216 30595 5.9 46975 9.15
BIB 111261 24146 21.7 23011 20.68
PROGL 71646 13692 19.1 12661 17.67
PROGC 39611 11614 29.3 10571 26.69
PROGP 49379 10207 20.6 9048 18.32
TRANS 93695 14273 15.2 13465 14.37
NEWS 377109 104435 27.6 97917 25.97
------ ------ ------- ----- ------ -----
14 3141622 706810 22.4 716386 22.80
rk outperforms PAQ1 on binary files such as GEO (seismic data, 32 bit
numbers), PIC (bitmapped 200 dpi FAX image, page from a textbook), OBJ1
(VAX executable), and OBJ2 (Macintosh executable). PAQ1 outperforms
rk on most text files, especially toward the end as more statistics are
collected.
rk also outperforms PAQ1 on random data, with practically no expansion
vs. 6.6% (8.5 bpc) for PAQ1. Slowing the aging of statistics (making the
model more stationary) decreases this expansion but hurts compression for
almost all other files. It is surprisingly difficult to detect
randomness embedded in compressible data without sacrificing the
ability to adapt to changing statistics.
ORIGINAL CODE: 1/6/01. Compiled using DJGPP 2.95.2.
REVISED 1/21/2001 to compile in Borland C++ 5.5.1.
- Removed unused typedef U64 (Borland does not support long long).
- In the CyclicModel constructor, changed cpos(255)
to cpos(256) to fix a vector out of bounds bug.
- In main(), added a test for filesize.size()==0 before calling
max_element(). The last two bugs caused PAQ1 to crash when compiled
with Borland but not g++. Archive compatibility is not affected.
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include