Matthew V. Mahoney

Computer Science Dept., Florida Institute of Technology

150 W. University Dr., Melbourne FL 32901

407-724-1582

mmahoney@cs.fit.edu

© 2000, AAAI, All rights reserved

Keywords: Neural networks, Text compression, Data compression, On-line training/learning, Maximum entropy, Prediction, Efficiency.

We follow the approach of Schmidhuber and Heil (1996) of using neural
network prediction followed by arithmetic encoding, a model that they
derived from PPM (prediction by partial match), developed by Bell, Witten,
and Cleary. In a predictive encoder, a predictor, observing a stream
of input characters, assigns a probability distribution for the next
character. An arithmetic encoder, starting with the real interval [0, 1],
repeatedly subdivides this range for each character
in proportion to the predicted distribution, with the largest
subintervals for the most likely characters. Then after the character is
observed, the corresponding subinterval becomes the new (smaller) range.
The encoder output is the shortest number that can be expressed as a
binary fraction within the resulting final range. Since arithmetic
encoding is optimal within one bit of the Shannon limit of
log_{2} 1/P(x) bits (where P(x) is the
probability of the entire input string, x), compression ratio depends
almost entirely on the predictor.

Schmidhuber and Heil replaced the PPM predictor (which matches the context of the last few characters to previous occurrences in the input) with a 3-layer neural network trained by back propagation to assign character probabilities when given the context as input. Unfortunately the algorithm was too slow to make it practical to test on standard benchmarks, such as the Calgary corpus (Bell, Witten, and Cleary, 1989). Training on a 10K to 20K text file required days of computation on an HP 700 workstation, and the prediction phase (which compressed English and German newspaper articles to 2.94 bpc) ran 1000 times slower than standard methods. In contrast, the 2-layer network we describe, which learns and predicts in a single pass, will compress a similar quantity of text in about 2 seconds on a 100 Mhz 80486 (typically achieving better compression).

In section 2, we describe a neural network that predicts the input one bit at a time, and show that it is equivalent to the maximum-entropy (Rosenfeld, 1996; Manning and Schütze, 1999, pp. 589 ff) approach to combining probabilistic constraints. In (3), we derive an optimal one-pass learning rate for independent, stationary inputs, and extend the model to the more general case. In (4), we compare this model to a number of variations and find experimentally that it is the best. In (5), we find experimentally that neural network compression performs almost as well as PPM and Burrows-Wheeler (current the best known character models) in both speed and compression ratio, and better (but slower) than popular Limpel-Ziv models such as UNIX compress, zip, and gzip. In (6) we describe the implementation.

The set of known probabilities P(y|x_{i}) does not
completely constrain the
joint distribution P(y|**x**) that we are interested in finding.
According to the maximum entropy principle, the most likely distribution
for P(y|**x**) is the one with the highest entropy, and furthermore it
must have the form (using the notation of Manning and Schütze)

- P(
**x**, y) = 1/Z P_{i}a_{i}^{fi(x, y)}

Taking the log of (1), and using the fact that P(y|x) = P(x, y)/P(x), we can rewrite (1) in the following form:

- P(y|
**x**) = 1/Z' exp(S_{i}w_{i}x_{i})

where w_{i} = log a_{i}.
Setting Z' so that
P(y = 0|**x**) + P(y = 1|**x**) = 1, we obtain

- P(y|
**x**) = g(S_{i}w_{i}x_{i}), where - g(x) = 1/(1 + e
^{-x})

which is a 2-layer neural network with M input units x_{i} and
a single output P(y = 1|**x**).

Analogous to GIS, we train the network to satisfy the known probabilities
P(y|x_{i}) by iteratively adjusting the weights w_{i}
to minimize the error, E = y - P(y), the difference between the expected
output y (the next bit to be observed), and the prediction P(y).

- w
_{i}= w_{i}+ Dw_{i} - Dw
_{i}= hx_{i}E

where h is the learning rate, usually set
*ad hoc* to around 0.1 to 0.5. This is fine for batch mode
training, but for on-line compression, where each training sample is
presented only once, we need to choose h
more carefully.

- p º P(y|x
_{i}= 1) = g(w_{i}) - w
_{i}= g^{-1}(p) = ln p/(1 - p) » ln N(1)/N(0)

where N(1) and N(0) are the number of times y has been observed to
be a 1 or 0 respectively in context x_{i}.
(We will return to the problem of zero counts later).

If we observe y = 1, then w_{i} is updated as follows:

- Dw
_{i}= ln (N(1)+1)/N(0) - ln N(1)/N(0) » 1/N(1)

- Dw
_{i}= ln N(1)/(N(0)+1) - ln N(1)/N(0) » -1/N(0)

We can now find h such that the learning
equation Dw_{i} =
hx_{i}E
(which converges whether or not the inputs are independent) is
optimal for independent inputs. For the case x_{i} = 1,

- Dw
_{i}= Dw_{i}E/E = hE - h = Dw
_{i}/E

- E = 1 - p = 1 - N(1)/N = N(0)/N
- h = Dw
_{i}/E = 1/EN(1) = N/N(0)N(1) = 1/s^{2}

- E = -p = -N(1)/N
- h = Dw
_{i}/E = -1/EN(0) = N/N(0)N(1) = 1/s^{2}

where N = N(0) + N(1), and s^{2} =
N(0)N(1)/N = Np(1 - p) is the variance of the training data.

If the inputs are correlated, then using the output error to adjust
the weights still allows the network to converge, but not at the
optimal rate. In the extreme case, if there are m perfectly correlated
copies of x_{i}, then the total effect will be equal to a
single input with weight mw_{i}. The optimal learning rate
will be h =
1/ms^{2} in this case. Since the
correlation cannot be determined locally, we introduce a parameter
h_{L}, called the long term learning
rate, where 1/m £ h_{L}
£ 1.
The weight update equation then becomes

- Dw
_{i}= h_{L}E/s^{2}

In addition, we have assumed that the data is stationary, that
p = P(y|x_{i}) does not change over time. The following bit
stream fragment of y for one particular context x_{i}
suggests that a better model is one where p periodically varies
between 0 and 1 over time. The input is
alice30.txt,
*Alice in Wonderland* by Lewis Carroll, from Project Gutenberg.
The context x_{i} = 1 for 256 particular values (selected
by a hash function) of the last 24 bits of input.

00000000000000010000000000000000000011111111000000000000000 00000000010000000000000111111111111111111111111000000000000 00000000000001000000000000000001111110100010100010100010100 01010001010001010001010001010001010001010001010001010001010 00101000101000110000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000001111 11111100000000000000000000000010000000000000000000001111111 11111111111100000100000001000001000000010000000000000000010 00000000000000100000100000001000001000000010000000000000000 01000000000000011111111111111111111111111000000000000000000

If the probability were fixed, we would not expect to find long strings
of both 0's and 1's. It is clear from this example that the last
few bits of input history is a better predictor of the next bit
than simply weighting all of the bits equally, as
h_{L} does. The solution is
to impose a lower bound, h_{S}, on
the learning rate, a parameter that we call the *short term*
learning rate. This has the effect of weighting the last
1/h_{S} bits more heavily.

- Dw
_{i}= (h_{S}+ h_{L}/s^{2})E

Finally, recall that s^{2}
= N(0)N(1)/N = Np(1 - p),
which means that the learning rate is undefined when either count is
0, or equivalently, when p = N(1)/N is 0 or 1. A common solution is
to add a small offset, d, to each of the counts N(0) and N(1).
For the case d = 1, we have Laplace's law of succession, which is optimal
when all values of p are *a priori* equally likely (Witten and Bell,
1991). Cleary and Teahan (1995) found, however, that the optimal value of d
depends on the data source. For text, their experiments suggest a value
of d » 0.1, though we found that 0.5 works
well. Thus, in (18), we use

- s
^{2}= (N+2d) / (N(0)+d)(N(1)+d)

For all of the tests, the input was *Alice in Wonderland*,
152,141 bytes. A neural network was constructed with 258K input units,
divided into four sets of contexts, listed below.
Exactly one input from each set was active at any time. The context
sets were:

- The last 8 bits of input, plus the position (0-7) within the
current byte, for a total of 2
^{11}contexts. Thus, unit x_{i}is active if the last 8 bits of input, concatenated with the 3-bit number representing the bit position 0-7, is exactly equal to the 11-bit number*i*. - The last 16 bits of input (2
^{16}contexts). - The last 24 bits of input, hashed to a 16 bit value
(2
^{16}contexts). - The last 32 bits of input, hashed to a 17 bit value
(2
^{17}contexts).

These contexts are similar to the ones used by PPM compressors, which typically use the last 1, 2, 3, and 4 characters to predict the next character.

The parameters of (18) and (19) where hand tuned to
h_{L} = 0.5,
h_{S} = 0.08, and d = 0.5, to optimize
compression, obtaining a baseline of 2.301 bpc.
This compares to 2.466 bpc for conventional training
(h_{S} tuned to 0.375 with
h_{L} fixed at 0),
2.368 bpc for the assumption of stationary, independent
inputs, (h_{L} = 1,
h_{S} = 0),
and 2.330 bpc allowing for some correlation between inputs,
(h_{L} tuned to 0.6-0.7 with
h_{S} fixed at 0).

- Dw
_{i}= Eh_{i} - h
_{i}:= (AEE_{i}+ 1)h_{i} - E
_{i}= E

where E and E_{i} are the current and previous errors,
and A < 1 is a parameter that determines how rapidly the
learning rate is adjusted. This works fairly well.
After hand tuning, compression of 2.344 bpc
(0.043 over baseline) was obtained with A = 0.5, and
h_{i} initially 0.5.

- Dw
_{i}= h_{i}E, where - h
_{i}is updated by h_{i}:= Ah_{i}+ B.

The result was 2.380 bpc (0.079 over baseline), only a little better
than the best fixed rate (2.466 bpc). This was obtained by tuning
A to 1/2, B to 1/8, and h_{i}
to an initial value of 1.5. With these parameters,
h_{i} converges to B/(1 - A) = 1/4.

d | bpc |
---|---|

0.1 | 3.930 |

0.25 | 3.798 |

0.5 | 3.800 |

1 (Laplacian) | 3.915 |

Binary PPM compression was much worse than fixed rate learning
(2.445 bpc), and worse than the 3.123 bpc obtained
by simply averaging the four probabilities (with d = 0.5). It
was thought that weighted averaging, with weights proportional to
set size (i.e. 2^{11}, 2^{16}, 2^{16},
2^{17}) to favor the longer and presumably more reliable
contexts, should improve compression over simple averaging.
Instead, compression worsened to 3.182 bpc.

Program | Version | Year | Type | Alice bpc | Compress | Decompress | Book1 bpc |
---|---|---|---|---|---|---|---|

compress | 4.3d | 1990 | LZ | 3.270 | 1 sec | 0.5 sec | 3.486 |

pkzip | 2.04e | 1993 | LZ | 2.884 | 6 sec | <0.5 sec | 3.288 |

gzip -9 | 1.2.4 | 1993 | LZ | 2.848 | 7 sec | 0.5 sec | 3.250 |

szip -b41 -o0 | 1.05Xf | 1998 | BW | 2.239 | 9 sec | 6 sec | 2.345 |

ha a2 | 0.98 | 1993 | PPM | 2.171 | 16 sec | 16 sec | 2.453 |

boa -m15 | 0.58b | 1998 | PPM | 2.061 | 33 sec | 34 sec | 2.204 |

rkive -mt3 | 1.91b1 | 1998 | PPM | 2.055 | 134 sec | 117 sec | 2.120 [2] |

NN small | P5 | 1999 | NN | 2.301 | 24 sec | 24 sec | 2.508 |

NN large | P6 | 1999 | NN | 2.129 | 26 sec | 26 sec | 2.283 |

Compression ratios and times for *Alice in Wonderland* and ratios
for *book1*

[2] *rkive* ran out of memory using the -mt3 option
on *book1*, so -mt2 was used. The options -mt0 through -mt3
produce successively better compression at the cost of memory.

The large model uses 5 sets of inputs. In each set, the index *i*
of the active input x_{i} is a 22-bit hash of the last n (1 to 5)
characters, the last 0 to 7 bits of the current character, and the position
(0 to 7) within the current character. For n = 1, there are only 64K
possible values. The other four sets (n = 2 through 5)
overlap each other in a space of size 2^{22} - 64K = 4,128,768.
The paramters were hand tuned to
h_{L} = 0.35 and
h_{S} = (0.08, 0.15, 0.2, 0.2, 0.2) for
the context sets for n = (1, 2, 3, 4, 5), with d = 0.5.

The seven compression program options were set for maximum text
compression. UNIX *compress*, *pkzip*, and *gzip*
are popular Limpel-Ziv
compressors because of their speed, and in spite of their poor
compression. They work by replacing reoccurring n-grams
with a pointer to the previous occurrence.

Gilchrist (1999) rated *boa* (Sutton, 1998)
as giving the best compression from among hundreds of archivers
on the Calgary corpus, and *rkive* (Taylor, 1998)
as the best on the text portion as of Sept. 1998. Both are PPM programs,
which predict characters based
on n-gram counts. Bell (1998) rated *szip* (Schindler, 1998),
a Burrows-Wheeler encoder,
the best on the large (2-4 MB) text portion of the
Canterbury corpus.
It sorts the input stream by context, then uses an
adaptive 1-gram compressor on the permuted stream. *ha*
(Hirvola, 1993) is an early PPM encoder, rated the best compressor in 1994
(Data Compression FAQ, 1998).

The large model compressed to within 0.074 bpc of *rkive*, the
best compressor on *Alice in Wonderland*, and 0.719 bpc better
than *gzip*, the best LZ compressor. With no further tuning,
it compressed within 0.163 bpc of *rkive* on *book1*, and
0.967 bpc better than *gzip*.

In another test, the large model was compared with the top-rated
*boa* on the Calgary corpus of 14 text and binary files, again with
no further tuning.
Both programs use about the same amount of memory (15 MB for *boa*,
16 MB for the neural network), and both compress in "solid" mode
(*boa* option *-s*),
treating the input as a single file. The result, using the same
file order, was 1.904 bpc for
*boa* in 720 seconds, and 2.142 bpc for the neural network
in 537 seconds.

The arithmetic encoder stores the current range as a pair of unsigned 32 bit numbers, writing out the leading bytes whenever they match. This, and using a 16 bit unsigned probability, results in a slight loss of efficiency, experimentally less than 0.0001 bpc.

Additional optimization techniques include the exclusive use of integer arithmetic, using bit shifts in place of division when possible, using table lookup to compute g(.), and representing the weights as an array of structures (instead of a structure of arrays) to optimize cache usage. The programs were written in C++.

Although we optimized the neural network to predict text, it was
found to give good performance on a wide variety of data
types. It would probably be better to have the program automatically
adjust h_{S} and
h_{L}, rather than setting them
*ad hoc*, but this would just introduce another parameter to
determine the adjustment rate. The problem of *ad hoc*
parameters seems to be unavoidable.

Research is ongoing, and it is expected that improved compressors will be found. We have yet to investigate models that exploit syntactic or semantic constraints, our reason for using neural networks in the first place. Improved models (C++ source code and Windows 95 executables) will be posted to http://mahoney4.home.netcom.com/compression as they become available.

Bell, Timothy (1998), Canterbury Corpus, http://corpus.canterbury.ac.nz/

M. Burrows and D. J. Wheeler (1994), A block-sorting lossless data compression algorithm , Digital Systems Research Center,

Cleary, John G., W. J. Teahan (1995), "Experiments on the zero frequency problem", Proc. Data Compression Conference, 480.

Data Compression FAQ (1998), http://www.cs.ruu.nl/wais/html/na-dir/compression-faq/.html

Feldman, J. A. and D. H. Ballard (1982), "Connectionist models and their properties", Cognitive Science, 6: 205-254

Gilchrist, Jeff (1999), Archive Comparison Test, http://act.by.net/act.html

Hirvola, H. (1993), ha 0.98, http://www.webwaves.com/arcers/msdos/ha098.zip

Manning, Christopher D., and Hinrich Schütze, (1999), Foundations of Statistical Natural Language Processing, Cambridge MA: MIT Press.

Rosenfeld, Ronald (1996), "A maximum entropy approach to adaptive statistical language modeling", Computer, Speech and Language, 10, see also http://www.cs.cmu.edu/afs/cs/user/roni/WWW/me-csl-revised.ps.

Schindler, Michael (1998), szip homepage, http://www.compressconsult.com/szip/

Schmidhuber, Jürgen, and Stefan Heil (1996), "Sequential neural text compression", IEEE Trans. on Neural Networks 7(1): 142-146.

Sutton, Ian (1998), boa 0.58 beta, ftp://ftp.cdrom.com/pub/sac/pack/boa058.zip

Taylor, Malcolm, RKIVE v1.91 beta 1 (1998), http://www.geocities.com/SiliconValley/Peaks/9463/rkive.html

Witten, Ian H., Timothy C. Bell (1991), "The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression", IEEE Trans. on Information Theory, 37(4): 1085-1094