Matt Mahoney

mmahoney@cs.fit.edu

Last update: Mar. 25, 2000

This FAQ is for Fast Text Compression with Neural Networks.

A predictive encoder has two parts (see below). The first part is the predictor: it assigns a probability to each character in the alphabet that it will occur next. The second part is the encoder. It peeks at the next character and assigns a code based on the probability assigned to it by the predictor. The idea is to assign short codes for the most likely characters, so that the output will be shorter on average.

During decompression, the predictor makes an identical series of probability estimates for each character. But now the decoder does the opposite of the encoder. It matches the code to what the encoder would have assigned to each character in order to recover the original character.CompressionP(a) = .04 +-----------+ P(b) = .003 +---------+ the cat in th_ --> | Predictor | --> ... --> | Encoder | --> X +-----------+ P(e) = .3 +---------+ | ... ^ | e --+ | | +----------+DecompressionP(a) = .04 v +-----------+ P(b) = .003 +---------+ the cat in th_ --> | Predictor | --> ... --> | Decoder | --> e ^ +-----------+ P(e) = .3 +---------+ | | ... | +---------------------------------------------------------+

0 .7 .8 1 +-----------------------------------+ | a |b| c|d| e | ... | t | ..... | +-----------------------------------+ / \ / \ .7 .74 .76 .8 +-----------------------------------+ | a |||| e || h | i |||| o |..| +-----------------------------------+ / \ / \ .74 .746 .752 .76 +-----------------------------------+ | a || e || i | .. | "the" = .75 (11 in binary) +-----------------------------------+The final output is a number, expressed as a binary fraction, from anywhere within the resulting range. In the example above, we could pick 0.75 (.11 in binary) because it is between .746 and .752. Thus, the steps to encode

- Predictor assigns, say, P(a) = .07, P(b) = .01, ..., P(t) = .10, ..., P(z) = .001
- Encoder sorts the input, assigns [0, .07] to
*a*, [.07, .08] to*b*, ..., [.70, .80] to*t*, ... - Next input character is
*t*, new range becomes [.70, .80] - Predictor (knowing the last letter was
*t*) assigns P(a|t), P(b|t), ..., P(h|t) = .2, ... P(z|t) - Encoder assigns a subrange to each letter, including 20% to h: [.74, .76]
- Next letter is
*h*, new range becomes [.74, .76] - Predictor assigns P(a|th), P(b|th), ... P(e|th) = .3, ...
- Encoder assigns [.746, .752] to
*the* - Input is
*e*, new range is [.746, .752] - End of input, output a number in the resulting range, say .75 (11 in binary)

- Predictor assigns exactly the same probabilities as before, P(a) = .07, P(b) = .01, ..., P(t) = .10, ..., P(z) = .001
- Encoder sorts the input, assigns [0, .07] to
*a*, [.07, .08] to*b*, ..., [.70, .80] to*t*, ... - Code is .75, so select
*t* - Predictor (knowing the last letter was
*t*) assigns P(a|t), P(b|t), ..., P(h|t) = .2, ... P(z|t) - Encoder assigns a subrange to each letter, including 20% to h: [.74, .76]
- Code is .75, so next letter must be
*h* - Predictor assigns P(a|th), P(b|th), ... P(e|th) = .3, ...
- Encoder assigns [.746, .752] to
*the* - Input is .75, so new range is [.746, .752] and output is
*e*

In our example, we have P(the) = P(t)P(h|t)P(e|th) = 0.1 x 0.2 x 0.3 = 0.006,
and log_{2} 1/0.006 = 7.38 bits. We can always find an 8 bit
code within any range of width 0.006 because these numbers
occur every 1/2^{8}, or about every 0.004.

The proof is easy for the simple case where all strings are equally likely.
If there are n possible strings each with probability 1/n, then we would
have to assign codes (in binary) of 0 through n-1. In binary, the
number n-1 has é log_{2} n
ù bits.

Samuel Morse developed one of the first
compression codes when he assigned short codes like "." for *e*
and "-" for *t*, and long codes like "--.." for *z*.

For instance, suppose we had to compress strings like *acbcaccbbaac*
where *a, b, *and* c* each occur with probability 1/3. We
might use the following code:

- a = 0
- b = 10
- c = 11

We could do better by coding two characters at a time:

- aa = 000
- ab = 001
- ac = 010
- ba = 011
- bb = 100
- bc = 101
- ca = 110
- cb = 1110
- cc = 1111

Arithmetic encoding assigns a single code to the entire string, so no
more than one bit is wasted. If we have *n* characters in string
*x*, then P(x) = 1/3^{n}, and the best we can do is
1/n log_{2} 1/P(x) = log_{2} 3 = 1.585 bpc.

*t*= [.70, .80], store*x1 = 70, x2 = 80*in memory.*th*= [.74, .76], output*7*, store*x1 = 40, x2 = 60*in memory.*the*= [.746, .752], store*x1 = 46, x2 = 52*in memory.

Some loss of efficiency occurs because we have to round off probabilities in order to divide the range [x1, x2] on integer boundaries. In practice, we lose less than 0.0001 bpc.

For compressing text, solving the prediction problem (language modeling)
implies passing the Turing test for artificial intelligence,
a problem that has remained unsolved since Turing first proposed it
in 1950. Shannon estimated (also
in 1950) that the entropy of written English text (i.e. the best compression
possible in theory) is between 0.6 and 1.3 bpc.
(My research suggests
1 bpc). The best language model, one by Rosenfeld in 1996, achieved
about 1.2 bpc on hundreds of megabytes of newspaper articles. On smaller
files, 2 bpc is typical of good compressors, and 3 bpc for popular
formats as *zip*, *gzip*, and *compress*.

The problem is even harder than this for general purpose compression. For example, suppose we take a 1 gigabyte file of zero bits and encrypt it with triple DES (or some other strong encryption) using the key "foobar". The output will appear to be a random string of 0's and 1's, and no known compression program (including mine) will obtain any compression on it whatsoever. Yet, I just described the file exactly in the second sentence of this paragraph, in a lot less than one gigabyte. A general purpose compression program should therefore do at least as well as my description, but that implies that it must be capable of breaking DES and every other known encryption system.

It gets worse. It can be proven that there is no compression algorithm that will compress every input string. If there were, then the compressed code could be compressed again, obtaining more compression. This could be repeated until the code had length 0. Obviously, this code could not be decompressed, since any string could have produced it.

In short, there is *no* best compression algorithm for all types
of data.

The Turing test for AI was proposed in 1950. Turing stated that if a machine is indistinguishable from a human communicating through a terminal, then the machine has AI.

Humans are very good at ranking probabilities of text strings, for instance
P(*the big dog*) > P(*dog big the*) > P(*dgo gbi eht*).
This is also true of dialogues or transcripts, for instance,
P(*Q: What color are roses? A: red*) >
P(*Q: What color are roses? A: blue*).
If a machine knew this distribution, it could generate responses to
an arbitrary question *Q* such that the response *A* would have
the distribution P(*QA*)/P(*Q*). By definition of P, this is
exactly how a human would respond.

(Actually P varies among humans, but this works to the advantage of the machine. For details, see my paper Text Compression as a Test for Artificial Intelligence, 1999 AAAI Proceedings).

This method is known as PPM, or Prediction by Partial Match.
Some of the best compression programs, *rkive* and *boa*,
use a variant called PPMZ, developed by Bloom in 1997.

- Word order models, such as P(
*dog*|*the big*), where the units are words rather than characters. These work best if we know where to divide the words. This is not as simple as it sounds. For instance,*New York*is one word (because its parts cannot stand alone), and*apples*is two words (*apple*and*-s*). - Semantic models. Observing a word like
*rose*increases the probability of observing related words like*red, fragrence, flower, garden*, etc. - Syntactic models. These understand ordering constraints among nouns, verbs, adjectives, etc.

A neural network also resembles the one known working language model, the human brain. Developing a system like this might lead to a better understanding of how humans process language.

To make a prediction, such as P(e|th) (that *th* will be followed by
*e*), all of the contexts that are present (1, h, and th)
are set to 1, and all others to 0. (The input labeled "1", representing
the 0-order context, is always on). Then the outputs are calculated
as follows.

y_{j} = g(S_{i}
w_{i,j}x_{i})

where g(x) = 1/(1 + e^{-x})

Then y_{i} is the probability that character *i* is next.
For example, P(e|th) = g(w_{e} + w_{he} + w_{the}).

The g(x) function has a range (0, 1), so it always represents a valid probability. It is an increasing function. For instance:

- g(-4) = 0.018
- g(-2) = 0.119
- g(-1) = 0.269
- g(0) = 0.5
- g(1) = 0.731
- g(2) = 0.881
- g(4) = 0.982

w_{i,j} ¬ w_{i,j} +
nx_{i}E_{j}

where E_{j} = y_{j} - P(j) is the error, the difference
between the actual and expected next character, and the constant
n is the *learning rate*.

Thus, in our example, if the next character is *e*, then we increase
w_{e}, w_{he}, and w_{the}, and decrease all
of the other weights connected to the inputs *1*, *h*,
and *th*. Note how this increases the predicted probability
of *e* the next time any of these contexts occur again.

We can do better than this. First, note that if only one input is active,
then we can solve for the weights directly. If the desired output is
*p*, then for the weight between them,

w = g^{-1}(p) = log p/(1 - p)

If we let p = N(1)/N (the fraction of times that the output was 1 in the past), then we can get the same effect by setting the learning rate to (N(0) + N(1))/N(0)N(1). (See my paper for proof). This requires that we store along with each weight the counts N(0) and N(1), the number of times the output was a 0 or 1 when the input was active. (These are called c0 and c1 in p5.cpp and p6.cpp).

In practice we initialize the counts to 0.5 instead of 0 to avoid probabilities of 0 or 1 (and an initially infinite learning rate).

Since the inputs are neither fully independent or fully dependent, NL should be somewhere between 1 and 1/m. I found that values around 1/sqrt(m) (about 0.4 for m = 5) work well in practice for text compression.

that thatch thaw the theatre theft th_

then it will assign the same probability to *a* and e,
even though the most recent statistics suggest that *e* is more
likely.

NS + NL x (N(0) + N(1))/N(0)N(1)

The effect is to bias the statistics in favor of the last 1/NS occurrences of the context, and "forgetting" earlier examples.

Context Length NS 1 0.08 2 0.15 3 0.2 4 0.2 5 0.2In text, character frequencies tend to be constant (

- The probability of a person with no symptoms having
disease
*y*(a cold) is P(y) = 0.001 - If symptom x
_{1}only (a headache) is present, the probability is P(y|x_{1}) = 0.1 - If symptom x
_{2}only (a sore throat) is present, the probability is P(y|x_{2}) = 0.5

This is a partially constrained problem over the joint distribution
P(x_{1}, x_{2}, y). There is not enough information
to solve the problem. We suspect that the probability will be high
(as ME predicts),
but in fact it could be zero without violating any of the
given constraints.

For the more general case where we wish to find P(y|X) =
P(y|x_{1}, x_{2}, ...,
x_{n}), the Maximum Entropy (ME) solution can be found by a
process called Generalized Iterative Scaling (GIS), and has the form

P(y|X) = 1/Z exp(S_{i}
w_{0} + w_{i}x_{i})

where the w_{i} are constants (weights) to be solved, and Z is a
constant chosen to make the probabilities add up to 1.
(See Rosenfeld 1996, or Manning and Schütze 1999).

GIS is essentially an algorithm for adjusting the weights, w_{i},
until the constraints are met, similar to neural network training.
In fact, when we solve for Z, we have our neural network equation.

P(y|X) = g(S_{i}
w_{0} + w_{i}x_{i})

where g(x) = 1/(1 + e^{-x})

We don't always have to use GIS. For our original example, we can solve for the weights directly to confirm our intuition.

- by setting x
_{1}and x_{2}to 0, w_{0}= log 0.001/0.999 = -6.906 - by setting x
_{1}to 1 and x_{2}to 0, w_{1}= log 0.1/0.9 - w_{0}= 4.709 - by setting x
_{1}to 0 and x_{2}to 1, w_{2}= log 0.5/0.5 - w_{0}= 6.906 - P(y|x
_{0},x_{1}) = g(w_{0}+ w_{1}+ w_{2}) = g(4.709) = 1/(1 + e^{-4.709}) = 0.991

For the longer contexts, I used a slightly different method to avoid having to do arithmetic on numbers longer than 32 bits, but the effect is the same.

For the two byte context (65536 possibilities), I gave each context its own input unit, so no hash function was needed. For the longer contexts, I mapped them into the same space, so it is possible that two contexts of different length might share the same input unit.

Another possibility is to use a data structure that avoids collisions entirely, such as a tree or a bucket hash table. This would improve compression on small files, but are not a solution, since all such data structures would eventually fill up or exhaust memory.

Bit prediction is otherwise equivalent to character predition. For instance,
the ASCII code for *e* is 01100101, so we can estimate its probability
as the product of the probabilities for each bit:

P(e|th) = P(0|th)P(1|th,0)P(1|th,01)P(0|th,011)P(0|th,0110)P(1|th,01100)P(0|011001)P(1|th,0110010).

There is a separate input unit for each of these 8 contexts.

Bloom, Charles, ppmz v9.1 (1997), http://www.cco.caltech.edu/~bloom/src/ppmz.html

Manning, Christopher D., and Hinrich Schütze, (1999), Foundations of Statistical Natural Language Processing, Cambridge MA: MIT Press.

Rosenfeld, Ronald (1996), "A maximum entropy approach to adaptive statistical language modeling", Computer, Speech and Language, 10, see also http://www.cs.cmu.edu/afs/cs/user/roni/WWW/me-csl-revised.ps.

Shannon, Claude, and Warren Weaver (1949), The Mathematical Theory of Communication, Urbana: University of Illinois Press.

Shannon, Cluade E. (1950), "Prediction and Entropy of Printed English", Bell Sys. Tech. J (3) p. 50-64.

Turing, A. M., (1950) "Computing Machinery and Intelligence, Mind, 59:433-460