Matthew V. Mahoney
Computer Science Dept., Florida Institute of Technology
150 W. University Dr., Melbourne FL 32901
© 2000, AAAI, All rights reserved
Keywords: Neural networks, Text compression, Data compression, On-line training/learning, Maximum entropy, Prediction, Efficiency.
We follow the approach of Schmidhuber and Heil (1996) of using neural network prediction followed by arithmetic encoding, a model that they derived from PPM (prediction by partial match), developed by Bell, Witten, and Cleary. In a predictive encoder, a predictor, observing a stream of input characters, assigns a probability distribution for the next character. An arithmetic encoder, starting with the real interval [0, 1], repeatedly subdivides this range for each character in proportion to the predicted distribution, with the largest subintervals for the most likely characters. Then after the character is observed, the corresponding subinterval becomes the new (smaller) range. The encoder output is the shortest number that can be expressed as a binary fraction within the resulting final range. Since arithmetic encoding is optimal within one bit of the Shannon limit of log2 1/P(x) bits (where P(x) is the probability of the entire input string, x), compression ratio depends almost entirely on the predictor.
Schmidhuber and Heil replaced the PPM predictor (which matches the context of the last few characters to previous occurrences in the input) with a 3-layer neural network trained by back propagation to assign character probabilities when given the context as input. Unfortunately the algorithm was too slow to make it practical to test on standard benchmarks, such as the Calgary corpus (Bell, Witten, and Cleary, 1989). Training on a 10K to 20K text file required days of computation on an HP 700 workstation, and the prediction phase (which compressed English and German newspaper articles to 2.94 bpc) ran 1000 times slower than standard methods. In contrast, the 2-layer network we describe, which learns and predicts in a single pass, will compress a similar quantity of text in about 2 seconds on a 100 Mhz 80486 (typically achieving better compression).
In section 2, we describe a neural network that predicts the input one bit at a time, and show that it is equivalent to the maximum-entropy (Rosenfeld, 1996; Manning and Schütze, 1999, pp. 589 ff) approach to combining probabilistic constraints. In (3), we derive an optimal one-pass learning rate for independent, stationary inputs, and extend the model to the more general case. In (4), we compare this model to a number of variations and find experimentally that it is the best. In (5), we find experimentally that neural network compression performs almost as well as PPM and Burrows-Wheeler (current the best known character models) in both speed and compression ratio, and better (but slower) than popular Limpel-Ziv models such as UNIX compress, zip, and gzip. In (6) we describe the implementation.
. Rosenfeld's perplexity of 68 is equivalent to
an entropy of log2 68 = 6.09 bits per word, or 1.11 bpc at
5.5 characters per word (including spaces). In addition, we assume
that the 2.2% of the words outside a 20,000 word vocabulary
that Rosenfeld modeled as a single token could be encoded at 3 bits
per 10 character word using a reasonable n-gram character model, adding
3 x 0.022 x 10/5.5 = 0.12 bpc. The 20K dictionary would add
The set of known probabilities P(y|xi) does not
completely constrain the
joint distribution P(y|x) that we are interested in finding.
According to the maximum entropy principle, the most likely distribution
for P(y|x) is the one with the highest entropy, and furthermore it
must have the form (using the notation of Manning and Schütze)
2. A Maximum Entropy Neural Network
In a predictive encoder, we are interested in predicting the next
input symbol y, given a feature vector x = x1,
x2,..., xM, where each xi is a
1 if a particular context is present in the input history, and
0 otherwise. In our implementation, we predict one bit at a
time, so y is either 0 or 1.
We can estimate P(y = 1|xi) »
N(1)/N, for each context xi,
by counting the number of times, N(1), that y = 1 occurs
in the N occurrences of that context. In an n-gram character model
(n about 4 to 6),
there would be one active input (xi = 1) for each of
the n contexts.
The set of known probabilities P(y|xi) does not completely constrain the joint distribution P(y|x) that we are interested in finding. According to the maximum entropy principle, the most likely distribution for P(y|x) is the one with the highest entropy, and furthermore it must have the form (using the notation of Manning and Schütze)
Taking the log of (1), and using the fact that P(y|x) = P(x, y)/P(x), we can rewrite (1) in the following form:
where wi = log ai. Setting Z' so that P(y = 0|x) + P(y = 1|x) = 1, we obtain
which is a 2-layer neural network with M input units xi and a single output P(y = 1|x).
Analogous to GIS, we train the network to satisfy the known probabilities P(y|xi) by iteratively adjusting the weights wi to minimize the error, E = y - P(y), the difference between the expected output y (the next bit to be observed), and the prediction P(y).
where h is the learning rate, usually set ad hoc to around 0.1 to 0.5. This is fine for batch mode training, but for on-line compression, where each training sample is presented only once, we need to choose h more carefully.
where N(1) and N(0) are the number of times y has been observed to be a 1 or 0 respectively in context xi. (We will return to the problem of zero counts later).
If we observe y = 1, then wi is updated as follows:
We can now find h such that the learning equation Dwi = hxiE (which converges whether or not the inputs are independent) is optimal for independent inputs. For the case xi = 1,
where N = N(0) + N(1), and s2 = N(0)N(1)/N = Np(1 - p) is the variance of the training data.
If the inputs are correlated, then using the output error to adjust the weights still allows the network to converge, but not at the optimal rate. In the extreme case, if there are m perfectly correlated copies of xi, then the total effect will be equal to a single input with weight mwi. The optimal learning rate will be h = 1/ms2 in this case. Since the correlation cannot be determined locally, we introduce a parameter hL, called the long term learning rate, where 1/m £ hL £ 1. The weight update equation then becomes
In addition, we have assumed that the data is stationary, that p = P(y|xi) does not change over time. The following bit stream fragment of y for one particular context xi suggests that a better model is one where p periodically varies between 0 and 1 over time. The input is alice30.txt, Alice in Wonderland by Lewis Carroll, from Project Gutenberg. The context xi = 1 for 256 particular values (selected by a hash function) of the last 24 bits of input.
00000000000000010000000000000000000011111111000000000000000 00000000010000000000000111111111111111111111111000000000000 00000000000001000000000000000001111110100010100010100010100 01010001010001010001010001010001010001010001010001010001010 00101000101000110000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000001111 11111100000000000000000000000010000000000000000000001111111 11111111111100000100000001000001000000010000000000000000010 00000000000000100000100000001000001000000010000000000000000 01000000000000011111111111111111111111111000000000000000000
If the probability were fixed, we would not expect to find long strings of both 0's and 1's. It is clear from this example that the last few bits of input history is a better predictor of the next bit than simply weighting all of the bits equally, as hL does. The solution is to impose a lower bound, hS, on the learning rate, a parameter that we call the short term learning rate. This has the effect of weighting the last 1/hS bits more heavily.
Finally, recall that s2 = N(0)N(1)/N = Np(1 - p), which means that the learning rate is undefined when either count is 0, or equivalently, when p = N(1)/N is 0 or 1. A common solution is to add a small offset, d, to each of the counts N(0) and N(1). For the case d = 1, we have Laplace's law of succession, which is optimal when all values of p are a priori equally likely (Witten and Bell, 1991). Cleary and Teahan (1995) found, however, that the optimal value of d depends on the data source. For text, their experiments suggest a value of d » 0.1, though we found that 0.5 works well. Thus, in (18), we use
For all of the tests, the input was Alice in Wonderland, 152,141 bytes. A neural network was constructed with 258K input units, divided into four sets of contexts, listed below. Exactly one input from each set was active at any time. The context sets were:
These contexts are similar to the ones used by PPM compressors, which typically use the last 1, 2, 3, and 4 characters to predict the next character.
The parameters of (18) and (19) where hand tuned to hL = 0.5, hS = 0.08, and d = 0.5, to optimize compression, obtaining a baseline of 2.301 bpc. This compares to 2.466 bpc for conventional training (hS tuned to 0.375 with hL fixed at 0), 2.368 bpc for the assumption of stationary, independent inputs, (hL = 1, hS = 0), and 2.330 bpc allowing for some correlation between inputs, (hL tuned to 0.6-0.7 with hS fixed at 0).
where E and Ei are the current and previous errors, and A < 1 is a parameter that determines how rapidly the learning rate is adjusted. This works fairly well. After hand tuning, compression of 2.344 bpc (0.043 over baseline) was obtained with A = 0.5, and hi initially 0.5.
The result was 2.380 bpc (0.079 over baseline), only a little better than the best fixed rate (2.466 bpc). This was obtained by tuning A to 1/2, B to 1/8, and hi to an initial value of 1.5. With these parameters, hi converges to B/(1 - A) = 1/4.
Binary PPM compression was much worse than fixed rate learning (2.445 bpc), and worse than the 3.123 bpc obtained by simply averaging the four probabilities (with d = 0.5). It was thought that weighted averaging, with weights proportional to set size (i.e. 211, 216, 216, 217) to favor the longer and presumably more reliable contexts, should improve compression over simple averaging. Instead, compression worsened to 3.182 bpc.
|Program||Version||Year||Type||Alice bpc||Compress||Decompress||Book1 bpc|
|compress||4.3d||1990||LZ||3.270||1 sec||0.5 sec||3.486|
|pkzip||2.04e||1993||LZ||2.884||6 sec||<0.5 sec||3.288|
|gzip -9||1.2.4||1993||LZ||2.848||7 sec||0.5 sec||3.250|
|szip -b41 -o0||1.05Xf||1998||BW||2.239||9 sec||6 sec||2.345|
|ha a2||0.98||1993||PPM||2.171||16 sec||16 sec||2.453|
|boa -m15||0.58b||1998||PPM||2.061||33 sec||34 sec||2.204|
|rkive -mt3||1.91b1||1998||PPM||2.055||134 sec||117 sec||2.120 |
|NN small||P5||1999||NN||2.301||24 sec||24 sec||2.508|
|NN large||P6||1999||NN||2.129||26 sec||26 sec||2.283|
Compression ratios and times for Alice in Wonderland and ratios for book1
 rkive ran out of memory using the -mt3 option on book1, so -mt2 was used. The options -mt0 through -mt3 produce successively better compression at the cost of memory.
The large model uses 5 sets of inputs. In each set, the index i of the active input xi is a 22-bit hash of the last n (1 to 5) characters, the last 0 to 7 bits of the current character, and the position (0 to 7) within the current character. For n = 1, there are only 64K possible values. The other four sets (n = 2 through 5) overlap each other in a space of size 222 - 64K = 4,128,768. The paramters were hand tuned to hL = 0.35 and hS = (0.08, 0.15, 0.2, 0.2, 0.2) for the context sets for n = (1, 2, 3, 4, 5), with d = 0.5.
The seven compression program options were set for maximum text compression. UNIX compress, pkzip, and gzip are popular Limpel-Ziv compressors because of their speed, and in spite of their poor compression. They work by replacing reoccurring n-grams with a pointer to the previous occurrence.
Gilchrist (1999) rated boa (Sutton, 1998) as giving the best compression from among hundreds of archivers on the Calgary corpus, and rkive (Taylor, 1998) as the best on the text portion as of Sept. 1998. Both are PPM programs, which predict characters based on n-gram counts. Bell (1998) rated szip (Schindler, 1998), a Burrows-Wheeler encoder, the best on the large (2-4 MB) text portion of the Canterbury corpus. It sorts the input stream by context, then uses an adaptive 1-gram compressor on the permuted stream. ha (Hirvola, 1993) is an early PPM encoder, rated the best compressor in 1994 (Data Compression FAQ, 1998).
The large model compressed to within 0.074 bpc of rkive, the best compressor on Alice in Wonderland, and 0.719 bpc better than gzip, the best LZ compressor. With no further tuning, it compressed within 0.163 bpc of rkive on book1, and 0.967 bpc better than gzip.
In another test, the large model was compared with the top-rated boa on the Calgary corpus of 14 text and binary files, again with no further tuning. Both programs use about the same amount of memory (15 MB for boa, 16 MB for the neural network), and both compress in "solid" mode (boa option -s), treating the input as a single file. The result, using the same file order, was 1.904 bpc for boa in 720 seconds, and 2.142 bpc for the neural network in 537 seconds.
The arithmetic encoder stores the current range as a pair of unsigned 32 bit numbers, writing out the leading bytes whenever they match. This, and using a 16 bit unsigned probability, results in a slight loss of efficiency, experimentally less than 0.0001 bpc.
Additional optimization techniques include the exclusive use of integer arithmetic, using bit shifts in place of division when possible, using table lookup to compute g(.), and representing the weights as an array of structures (instead of a structure of arrays) to optimize cache usage. The programs were written in C++.
Although we optimized the neural network to predict text, it was found to give good performance on a wide variety of data types. It would probably be better to have the program automatically adjust hS and hL, rather than setting them ad hoc, but this would just introduce another parameter to determine the adjustment rate. The problem of ad hoc parameters seems to be unavoidable.
Research is ongoing, and it is expected that improved compressors will be found. We have yet to investigate models that exploit syntactic or semantic constraints, our reason for using neural networks in the first place. Improved models (C++ source code and Windows 95 executables) will be posted to http://mahoney4.home.netcom.com/compression as they become available.
Bell, Timothy (1998), Canterbury Corpus, http://corpus.canterbury.ac.nz/
M. Burrows and D. J. Wheeler (1994), A block-sorting lossless data compression algorithm , Digital Systems Research Center,
Cleary, John G., W. J. Teahan (1995), "Experiments on the zero frequency problem", Proc. Data Compression Conference, 480.
Data Compression FAQ (1998), http://www.cs.ruu.nl/wais/html/na-dir/compression-faq/.html
Feldman, J. A. and D. H. Ballard (1982), "Connectionist models and their properties", Cognitive Science, 6: 205-254
Gilchrist, Jeff (1999), Archive Comparison Test, http://act.by.net/act.html
Hirvola, H. (1993), ha 0.98, http://www.webwaves.com/arcers/msdos/ha098.zip
Manning, Christopher D., and Hinrich Schütze, (1999), Foundations of Statistical Natural Language Processing, Cambridge MA: MIT Press.
Rosenfeld, Ronald (1996), "A maximum entropy approach to adaptive statistical language modeling", Computer, Speech and Language, 10, see also http://www.cs.cmu.edu/afs/cs/user/roni/WWW/me-csl-revised.ps.
Schindler, Michael (1998), szip homepage, http://www.compressconsult.com/szip/
Schmidhuber, Jürgen, and Stefan Heil (1996), "Sequential neural text compression", IEEE Trans. on Neural Networks 7(1): 142-146.
Sutton, Ian (1998), boa 0.58 beta, ftp://ftp.cdrom.com/pub/sac/pack/boa058.zip
Taylor, Malcolm, RKIVE v1.91 beta 1 (1998), http://www.geocities.com/SiliconValley/Peaks/9463/rkive.html
Witten, Ian H., Timothy C. Bell (1991), "The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression", IEEE Trans. on Information Theory, 37(4): 1085-1094