/* PAQ4 v2 - File archiver and compressor.
(C) 2003, Matt Mahoney, mmahoney@cs.fit.edu
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation at
http://www.gnu.org/licenses/gpl.txt or (at your option) any later version.
This program is distributed without any warranty.
USAGE
To compress: PAQ4 archive file file... (1 or more file names), or
or (MSDOS): dir/b | PAQ4 archive (read file names from input)
or (UNIX): ls | PAQ4 archive
To decompress: PAQ4 archive
To list contents: more < archive
Compression: The files listed are compressed and stored in the archive,
which is created. The archive must not already exist. File names may
specify a path, which is stored. If there are no file names on the command
line, then PAQ4 prompts for them, reading until the first blank line or
end of file.
Decompression: No file names are specified. The archive must exist.
If a path is stored, the file is extracted to the appropriate directory,
which must exist. PAQ4 does not create directories. If the file to be
extracted already exists, it is not replaced; rather it is compared with
the archived file, and the offset of the first difference is reported.
It is not possible to add, remove, or update files in an existing archive.
If you want to do this, extract the files, delete the archive, and
create a new archive with just the files you want.
TO COMPILE
gxx -O paq4.cpp DJGPP 2.95.2 (gives fastest executable)
bcc32 -O2 paq4.cpp Borland 5.5.1
sc -o paq4.cpp Digital Mars 8.35n
FILE FORMAT
A PAQ4 archive has the following format. The first 6 bytes are
"PAQ4\r\n" followed by a list of file sizes and names in
"%ld\t%s\r\n" format, e.g. 768771book1. The last filename
is followed by "\x1A\f\0" () and the
compressed data.
The compressor uses predictive arithmetic coding. The compressed data
is a base 256 represenation of a number x (e.g. the first byte has
weight 1/256, second is 1/65536, etc.) such that x has as few digits as
possible and satisfies P(s > y) <= x < P(s >= y), where y is the concatenated
files, s is a random string with probability distribution P, and (s > y)
means that s follows y in a lexicographical ordering over strings.
As y is read in, the bounds of possible x narrow, allowing the leading
digits to be written as they become known. Computation of x
is approximated using integer arithmetic, which expands the
data less than 0.001% over the Shannon limit of log2 1/P(y) bits.
Prediction (estimation of P(y)) is independent of coding. The model
by which P is estimated is built from the previous uncompressed data
regardless of whether it is being input or output, so the model
is identical for compression and decompression. The model consists
of an adaptive weighting of submodels, followed by context sensitive
probability adjustment using SSE (secondary symbol estimation). It
differs from PAQ3 and PAQ3N that the weighting of models is adaptive
rather than fixed. The cyclic model for fixed length records has also
been improved.
There are N = 19 submodels, which consist of counts of 0 and 1 bits that
appear in different contexts. These are combined to a probability P(1)
that the next bit of input will be a 1 as follows:
P(1) = SUM(i=1..N) wi (n1i / ni)
where wi is the i'th weight (determined adaptively), n1i is the number
of 1 bits observed in the i'th context, and ni = n0i + n1i is the
total number of times the i'th context was observed.
The weights wi are adjusted after each bit of uncompressed data to
reduce the cost (code length) of that bit. The adjustment is along
the gradient of the cost in weight space, which is
wi := wi + e[nyi/(SUM(j) (wj+wo) nyj) - ni/(SUM(j) (wj+wo) nj)]
where e and wo are small constants, and nyi is the count for bit y (0 or 1)
in the i'th context. The weight offset wo prevents the gradient from
going to infinity as the weights go to 0.
The weight vector is context sensitive. There are 8 vectors which
are trained and used depending on the context of the 3 high order
bits of the previous whole byte.
In order to favor recent data over older data, counts are discarded
in a way that gives higher weights to probabilities near 0 or 1.
After bit y is observed, the count ny is incremented, but half of the
opposite count over 2 is discarded. For example, if n0=3, n1=10, then
after a 0 is observed, then half of the 8 bits over 2 in n1 are discarded,
so n0=4, n1=6.
The 19 submodels are as follows. The current 0-7 bits of the partially
compressed byte is always included in the context except in the fixed
model.
- Fixed (n0 = n1 = 1)
- Order 0. The context is the 0-7 bits of the current byte.
- Order 1. The context is the last byte plus the current 0-7 bits.
- Order 2. The context is the last 2 bytes plus the current 0-7 bits.
- Order 3.
- Order 4.
- Order 5.
- Order 6.
- Order 7.
- Word. The context is the last whole word, defined as the last 0 to 8
contiguous bytes with values of 33 or more, plus the current bits.
- Sparse order 2, skipping the next to last byte (x.x)
- Sparse order 2, skipping 2 bytes (x..x), the 4th to last and last bytes.
- Sparse order 2, the 2nd and 4th last bytes (x.x.)
- Sparse order 2, the 4th and 8th last bytes (x...x...)
- Sparse order 2, the 2nd and 3rd last (xx.)
- Sparse order 2, the 3rd and 4th last (xx..)
- A long string model consisting of the last match of 8 or more bytes.
- A fixed record model consisting of the two bytes above (as a table)
- A fixed record model consisting of the byte above and to the left.
Counts for the order 2-7 and word models share a 32M element hash table,
where both counts are represented by an 8 bit state. Hash collisions
are detected (usually) by using an 8 bit checksum of the context,
with linear search over 3 buckets. If all 3 are full, then the element
with the lowest priority is replaced, where priority = n0 + n1.
The order 0 and 1 models use unhashed tables. The sparse order 2 models
share a 4M hash table. The two fixed record models share a 2M hash table
The long string model uses a 1M hash table to index a 4MB rotating buffer
with 8 and 32 bute hashes to find the most recent context match of
8 bytes or more. The count is set to nu = the length of the
match, where y is the bit that occured in that context, and the other
count to 0. For example, given "abcdefghij1...abcdefghij" then
n0=0, n1=10.
The fixed record model determines the record size (table width) by
searching for 4 consecutive appearances of the same character equally
spaced with an interval of at least 3. For instance, if the
string "a..a..a..a" is observed, then the record length is set to 3.
It remains set until another sequence is observed.
The probability P(1) calculated by the weighted sum of models is
further adjusted by SSE in the context of the current
0-7 bits of the partially compressed byte and the 2 high order bits of
the previous byte (1024 contexts). In each SSE context, P(1) is divided
into 32 bins, and the actual counts of 0s and 1s observed in each bin
are used to compute the new probability. When P(1) falls between bins,
the counts in the bins on both sides are incremented, and the output
probability is interpolated between the bins. The bin spacing is
closer together for output probabilities near 0 or 1. The counts are
not discounted to favor recent data as in the 19 submodels. Instead,
when the counters (8 bits) overflow, both n0 and n1 are halved. The
SSE mapping is initialized to SSE(P) = P. The final probability Pf
is then averaged to Pf = (3 SSE(P) + P) / 4.
Thanks to Serge Osnach for introducing me to SSE (in PAQ1SSE/PAQ2) and
the sparse models (PAQ3N). Also, credit to Eugene Shelwein,
Dmitry Shkarin for suggestions on using multiple character SSE contexts.
Credit to Eugene, Serge, and Jason Schmidt for developing faster and
smaller executables of previous versions. Credit to Werner Bergmans
and Berto Destasio for testing and evaluating them, including modifications
that improve compression at the cost of more memory.
I expect there will be better versions in the future. If you make any
changes, please change the name of the program (e.g. PAQ5), including
the string in the archive header by redefining PROGNAME below.
This will prevent any confusion about versions or archive compatibility.
Also, give yourself credit in the help message.
*/
#define PROGNAME "PAQ4" // Please change this if you change the program
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
const int PSCALE=2048; // Integer scale for representing probabilities
const int MEM=8; // 6 = 88 MB, increase by 1 to double it
template inline int size(const T& t) {return t.size();}
// 8-32 bit unsigned types, adjust as appropriate
typedef unsigned char U8;
typedef unsigned short U16;
typedef unsigned long U32;
class U24 { // 24-bit unsigned int
U8 b0, b1, b2; // Low, mid, high byte
public:
explicit U24(int x=0): b0(x), b1(x>>8), b2(x>>16) {}
operator int() const {return (((b2<<8)|b1)<<8)|b0;}
};
// 32-bit random number generator based on r(i) = r(i-24) ^ r(i-55)
class Random {
U32 table[55]; // Last 55 random values
int i; // Index of current random value in table
public:
Random();
U32 operator()() { // Return 32-bit random number
if (++i==55) i=0;
if (i>=64) return table[i]^=table[i-16]; // emilcont
else return table[i]^=table[i+63]; // emilcont
}
} rnd;
Random::Random(): i(0) { // Seed the table
table[0]=123456789;
table[1]=987654321;
for (int j=2; j<55; ++j)
table[j]=table[j-1]*11+table[j-2]*0.1; // emilcont
}
/* 3 byte counter, shown for reference only. It implements a
nonstationary pair of counters of 0s and 1s such that preference is
given to recent history by discarding old bits. */
class Counter3 {
U8 n[2]; // n[y] is the counts of ys (0s or 1s)
public:
Counter3() {n[0]=n[1]=0;}
int get0() const {return n[0];} // Return count of 0s
int get1() const {return n[1];} // Return count of 1s
int priority() const {return get0()+get1();} // For hash replacement
void add(int y) { // Add y (0 or 1) to n[y] and age the opposite count
if (n[y]<255) ++n[y];
if (n[1-y]>2) n[1-y]/=2, n[1-y]+=1;
}
};
/* Approximately equivalent 2 byte counter implementing the above.
The representable counts (n0, n1) are 0-10, 12, 14, 16, 20, 24, 28,
32, 48, 64, 128, 256, 512. Both counts are represented by a single
8-bit state. Counts larger than 10 are incremented probabilistically.
Although it uses 1/3 less memory, it is 8% slower and gives 0.05% worse
compression than the 3 byte counter. */
class Counter2 {
U8 state;
struct E { // State table entry
U16 n0, n1; // Counts represented by state
U8 s00, s01; // Next state on input 0 without/with probabilistic incr.
U8 s10, s11; // Next state on input 1
U32 p0, p1; // Probability of increment x 2^32-1 on inputs 0, 1
};
static E table[150]; // State table
public:
Counter2(): state(0) {}
int get0() const {return table[state].n0;}
int get1() const {return table[state].n1;}
int priority() const {return state;}
void add(int y) {
if (y) {
if (state<75 || rnd() is a hash table of 2^N elements of type T
with linear search of M=3 elements in case of collision. If all elements
collide, then the one with the lowest T.priority() is replaced.
Hashtable[i] returns a T& indexed by the lower N bits of i whose
checksum matches bits 31-24 of i, creating or replacing if needed.
The user must supply T::priority() and a 32-bit hash. */
template
class Hashtable {
private:
struct HashElement {
U8 checksum;
T value;
HashElement(int ch=0): checksum(ch), value() {}
};
vector table; // Array of 2^N+M elements
public:
Hashtable(): table((1<
T& Hashtable::operator[](U32 i) {
const U32 lb=i&((1<>24;
int bj=lb;
int bget=1000000000;
for (U32 j=lb; j= 8 bytes between
the current context and previous input, and predicts the next bit
in the previous context with weight n. If the next bit is 1, then
(n0[0], n1[0]) is assigned (0, n), else (n, 0). Matchies are found in
a 4MB rotating buffer using a 1M hash table of pointers. */
class MatchModel {
enum {N=1}; // Number of submodels
vector buf; // Input buffer, wraps at end
vector ptr; // Hash table of pointers
U32 hash[2]; // Hashes of current context up to pos-1
int pos; // Element of buf where next bit will be stored
int bpos; // Number of bits (0-7) stored at buf[pos]
int begin; // Points to first matching byte (does not wrap)
int end; // Points to last matching byte + 1, 0 if no match
public:
MatchModel(): buf(0x10000<>(8-bpos))!=buf[pos]) // Does it match?
begin=end=0; // no
if (bpos==8) { // New byte
bpos=0;
hash[0]=hash[0]*(16*56797157)+buf[pos]+1; // Hash last 8 bytes
hash[1]=hash[1]*(2*45684217)+buf[pos]+1; // Hash last 32 bytes
U32 h=hash[0] >> (10); // emilcont
if ((hash[0]>>28)==0)
h=hash[1] >> (10); // 1/16 of 8-contexts are hashed to 32 bytes emilcont
if (++pos==int(buf.size()))
pos=0;
if (end)
++end;
else { // Search for a matching context
end=ptr[h];
if (end>0) {
begin=end;
int p=pos;
while (begin>0 && p>0 && begin!=p+1 && buf[begin-1]==buf[p-1]) {
--begin;
--p;
}
}
if (end==begin) // No match found
begin=end=0;
}
ptr[h]=U24(pos);
}
*n0=*n1=0;
if (end) {
int wt=end-begin;
if (wt>255) wt=255;
int y=(buf[end]>>(7-bpos))&1;
if (y) *n1=wt;
else *n0=wt;
}
}
/* A RecordModel finds fixed length records and models bits in the context
of the two bytes above (the same position in the two previous records)
and in the context of the byte above and to the left (the previous byte).
The record length is assumed to be the interval in the most recent
occurrence of a byte occuring 4 times in a row equally spaced, e.g.
"x..x..x..x" would imply a record size of 3 (the mimimum). */
class RecordModel {
enum {N=2}; // Number of models
static int lpos[256][4]; // Position of last 4 bytes
vector buf; // Rotating history [4K]
Hashtable t; // Model emilcont
int pos; // Byte counter
int c0; // Leading 1 and 0-7 bits
int repeat; // Cycle length
int hash[N]; // of c0, context and repeat
Counter* cp[N]; // Pointer to current counter
public:
RecordModel();
void model(int y, int* n0, int* n1); // Update and predict
} recordModel;
int RecordModel::lpos[256][4]={{0}};
RecordModel::RecordModel(): buf(4096), pos(0), c0(1), repeat(1) {
hash[0]=hash[1]=0;
cp[0]=cp[1]=&t[0];
}
// Update the model with bit y, then put predictions of the next update
// as 0 counts in n0[0..N-1] and 1 counts in n1[0..N-1]
void RecordModel::model(int y, int* n0, int* n1) {
// Update the counters with bit y
cp[0]->add(y);
cp[1]->add(y);
c0+=c0+y;
if (c0>=256) {
c0-=256;
// Save positions of last 4 occurrences of current byte c0
int (&lp)[4]=lpos[c0];
for (int j=3; j>0; --j)
lp[j]=lp[j-1];
lp[0]=pos;
// Check for a repeating pattern of interval 3 or more
if (lp[2]-lp[3]==lp[1]-lp[2] && lp[1]-lp[2]==lp[0]-lp[1] && lp[0]-lp[1]>2)
repeat=lp[0]-lp[1];
// Save c0
buf[pos&(buf.size()-1)]=c0;
++pos;
// Compute context hashes
hash[0]=buf[(pos-repeat)&(buf.size()-1)]*578997481
+buf[(pos-repeat*2)&(buf.size()-1)]*878996291
+repeat*378997811; // Context is 2 bytes above
hash[1]=buf[(pos-repeat)&(buf.size()-1)]*236399113
+buf[(pos-1)&(buf.size()-1)]*736390141
+repeat*984388129; // Context is the byte above and to the left
c0=1;
}
// Set up counter pointers
cp[0]=&t[hash[0]+(c0<<24)+((c0*3)&255)];
cp[1]=&t[hash[1]+(c0<<24)+((c0*3)&255)];
// Predict the next value of y
n0[0]=cp[0]->get0();
n1[0]=cp[0]->get1();
n0[1]=cp[1]->get0();
n1[1]=cp[1]->get1();
}
/* A CharModel contains a fixed model, n-gram models from 0 to 7,
and several order-2 sparse models which skip over parts of the context. */
class CharModel {
enum {N=16}; // Number of models
vector t0; // Order 0 counts [256]
vector t1; // Order 1 counts [64K]
Hashtable t2; // Sparse models // emilcont
Hashtable t; // Order 2-7 models // emilcont
int c0; // Current partial byte with leading 1 bit
int c1, c2, c3, c4, c5, c6, c7; // Previous bytes
vector cxt; // Context hashes [N]
vector cp; // Pointers to current counts [N]
public:
CharModel();
void model(const int y, int* n0, int* n1); // Update and predict
int getc0() const {return c0;}
int getc1() const {return c1;}
int getc2() const {return c2;}
} charModel;
CharModel::CharModel(): t0(256), t1(65536), t(),
c0(1), c1(0), c2(0), c3(0), c4(0), c5(0), c6(0), c7(0),
cxt(N), cp(N) {
for (int i=0; iadd(y);
// Update context
c0+=c0+y;
if (c0>=256) { // Start new byte
for (int i=8; i>1; --i)
cxt[i]=(cxt[i-1]+c0)*987660757;
c0-=256;
if (c0>32)
cxt[9]=(cxt[9]+c0)*34609027*4; // Last whole word (8 chars max)
else
cxt[9]=0;
cxt[10]=c0*837503159+c2*537496093; // Sparse models
cxt[11]=c0*840101893+c3*850090301;
cxt[12]=c1*769377353+c3*653589317;
cxt[13]=c3*368910977+c7*568909763;
cxt[14]=c1*950087393+c2*970092001;
cxt[15]=c2*629495183+c3*649492307;
c7=c6;
c6=c5;
c5=c4;
c4=c3;
c3=c2;
c2=c1;
c1=c0;
c0=1;
}
// Set up pointers to next counters
cp[1]=&t0[c0];
cp[2]=&t1[c0+c1*256];
for (int i=3; i<10; ++i)
cp[i]=&t[cxt[i]+(c0<<24)+((c0*3)&255)];
for (int i=10; iget0();
n1[i]=cp[i]->get1();
}
}
/* A MixModel mixes an array of counts to guess the next bit. The counts
are adaptively weighted to minimize cost. model(y, pr) updates all the
models with y, gets their bit counts, weights them, and returns the
probability P(1) * PSCALE in pr. The weights are adjusted to minimize
the cost of encoding y. There are C sets of weights, selected by
a context (the 3 upper bits of the last whole byte). */
class MixModel {
enum {N=19, C=8}; // Number of models, number of weight contexts
vector wt; // Context weights [C, N]
int bc0[N], bc1[N]; // Bit counts concatenated from various models
int cxt2; // Secondary context to select a weight vector
public:
MixModel();
~MixModel();
void model(int y, int& pr); // Update with bit y, then put next guess in pr
} mixModel;
MixModel::MixModel(): wt(C*N), cxt2(0) {
for (int i=0; i0 && s1>0) {
const int s=s0+s1;
const int sy=y?s1:s0;
const int sy1=0xffffffff/sy+(rnd()&1023) >> 10;
const int s1 =0xffffffff/s +(rnd()&1023) >> 10;
for (int i=0; i> 8;
wt[cn+i]=min(65535, max(1, wt[cn+i]+dw));
}
}
}
// Get counts
charModel.model(y, bc0, bc1);
recordModel.model(y, bc0+16, bc1+16);
matchModel.model(y, bc0+18, bc1+18);
// Update secondary context
cxt2=charModel.getc1()/(256/C);
// Predict next bit
int n0=1, n1=n0;
for (int j=0; j0);
while (n>2000000000/PSCALE) n/=4, n1/=4;
pr=int((PSCALE-1)*n1/n);
}
/* A Predictor adjusts the model probability using SSE and passes it
to the encoder. An SSE model is a table of counters, sse[SSE1][SSE2]
which maps a context and a probability into a new, more accurate
probability. The context, SSE1, consists of the 0-7 bits of the current
byte and the 2 leading bits of the previous byte. The probability
to be mapped, SSE2 is first stretched near 0 and 1 using SSEMap, then
quantized into SSE2=32 intervals. Each SSE element is a pair of 0
and 1 counters of the bits seen so far in the current context and
probability range. Both the bin below and above the current probability
is updated by adding 1 to the appropriate count (n0 or n1). The
output probability for an SSE element is n1/(n0+n1) interpolated between
the bins below and above the input probability. This is averaged
with the original probability with 25% weight to give the final
probability passed to the encoder. */
class Predictor {
enum {SSE1=256*4, SSE2=32, // SSE dimensions (contexts, probability bins)
SSESCALE=1024/SSE2}; // Number of mapped probabilities between bins
// Scale probability p into a context in the range 0 to 1K-1 by
// stretching the ends of the range.
class SSEMap {
U16 table[PSCALE];
public:
int operator()(int p) const {return table[p];}
SSEMap();
} ssemap; // functoid
// Secondary source encoder element
struct SSEContext {
U8 c1, n; // Count of 1's, count of bits
int p() const {return PSCALE*(c1*64+1)/(n*64+2);}
void update(int y) {
if (y)
++c1;
if (++n>254) { // Roll over count overflows
c1/=2;
n/=2;
}
}
SSEContext(): c1(0), n(0) {}
};
vector > sse; // [SSE1][SSE2+1] context, mapped prob
int nextp; // p()
int ssep; // Output of sse
int context; // SSE context
public:
Predictor();
int p() const {return nextp;} // Returns pr(y = 1) * PSCALE
void update(int y); // Update model with bit y = 0 or 1
};
Predictor::SSEMap::SSEMap() {
for (int i=0; i1023) p=1023;
if (p<0) p=0;
table[i]=p;
}
}
Predictor::Predictor(): sse(SSE1), nextp(PSCALE/2), ssep(512), context(0) {
// Initialize to sse[context][ssemap(p)] = p
for (int i=0; i=0; --i) {
int p=(ssemap(i*PSCALE/N)+SSESCALE/2)/SSESCALE;
int n=1+N*N/((i+1)*(N-i));
if (n>254) n=254;
int c1=(i*n+N/2)/N;
for (int j=oldp-1; j>=p; --j) {
for (int k=0; k=0x4000000) xmid+=(xdiff>>12)*p;
else if (xdiff>=0x100000) xmid+=((xdiff>>6)*p)>>6;
else xmid+=(xdiff*p)>>12;
// Update the range
if (y)
x2=xmid;
else
x1=xmid+1;
predictor.update(y);
// Shift equal MSB's out
while (((x1^x2)&0xff000000)==0) {
putc(x2>>24, archive);
x1<<=8;
x2=(x2<<8)+255;
}
}
/* Decode one bit from the archive, splitting [x1, x2] as in the encoder
and returning 1 or 0 depending on which subrange the archive point x is in.
*/
inline int Encoder::decode() {
// Split the range
const U32 p=predictor.p()*(4096/PSCALE)+2048/PSCALE; // P(1) * 4K
assert(p<4096);
const U32 xdiff=x2-x1;
U32 xmid=x1; // = x1+p*(x2-x1) multiply without overflow, round down
if (xdiff>=0x4000000) xmid+=(xdiff>>12)*p;
else if (xdiff>=0x100000) xmid+=((xdiff>>6)*p)>>6;
else xmid+=(xdiff*p)>>12;
// Update the range
int y=0;
if (x<=xmid) {
y=1;
x2=xmid;
}
else
x1=xmid+1;
predictor.update(y);
// Shift equal MSB's out
while (((x1^x2)&0xff000000)==0) {
x1<<=8;
x2=(x2<<8)+255;
int c=getc(archive);
if (c==EOF) c=0; // Fix for version 2
x=(x<<8)+c;
}
return y;
}
// Should be called when there is no more to compress
void Encoder::flush() {
// In COMPRESS mode, write out the remaining bytes of x, x1 < x < x2
if (mode==COMPRESS) {
while (((x1^x2)&0xff000000)==0) {
putc(x2>>24, archive);
x1<<=8;
x2=(x2<<8)+255;
}
putc(x2>>24, archive); // First unequal byte
}
}
// Read one byte from encoder e
int decompress(Encoder& e) { // Decompress 8 bits, MSB first
int c=0;
for (int i=0; i<8; ++i)
c=c+c+e.decode();
return c;
}
// Write one byte c to encoder e
void compress(Encoder& e, int c) {
for (int i=7; i>=0; --i)
e.encode((c>>i)&1);
}
// Fail if out of memory
void handler() {
printf("Out of memory\n");
exit(1);
}
// Read and return a line of input from FILE f (default stdin) up to
// first control character except tab. Skips CR in CR LF.
string getline(FILE* f=stdin) {
int c;
string result="";
while ((c=getc(f))!=EOF && (c>=32 || c=='\t'))
result+=char(c);
if (c=='\r')
(void) getc(f);
return result;
}
// User interface
int main(int argc, char** argv) {
set_new_handler(handler);
// Test the compiler
assert(sizeof(U8)==1);
assert(sizeof(U16)==2);
assert(sizeof(U24)==3);
assert(sizeof(U32)==4);
assert(sizeof(int)==4);
// Check arguments
if (argc<2) {
printf(
PROGNAME " v2 file compressor/archiver, (C) 2003, Matt Mahoney, mmahoney@cs.fit.edu\n"
"This program is free software distributed without warranty under the terms\n"
"of the GNU General Public License, see http://www.gnu.org/licenses/gpl.txt\n"
"\n"
"To compress: " PROGNAME " archive filenames... (archive will be created)\n"
" or (MSDOS): dir/b | " PROGNAME " archive (reads file names from input)\n"
"To extract/compare: " PROGNAME " archive (does not clobber existing files)\n"
"To view contents: more < archive\n");
return 1;
}
// File names and sizes from input or archive
vector filename; // List of names
vector filesize; // Size or -1 if error
int start_time=clock();
int uncompressed_bytes=0, compressed_bytes=0; // Input, output sizes
// Extract files
FILE* archive=fopen(argv[1], "rb");
if (archive) {
if (argc>2) {
printf("File %s already exists\n", argv[1]);
return 1;
}
printf("Extracting archive %s ...\n", argv[1]);
// Read PROGNAME "\r\n" at start of archive
if (getline(archive) != PROGNAME) {
printf("Archive file %s not in " PROGNAME " format\n", argv[1]);
return 1;
}
// Read "size filename" in "%d\t%s\r\n" format
while (true) {
string s=getline(archive);
if (s.size()>1) {
filesize.push_back(atol(s.c_str()));
string::iterator tab=find(s.begin(), s.end(), '\t');
if (tab!=s.end())
filename.push_back(string(tab+1, s.end()));
else
filename.push_back("");
}
else
break;
}
// Test end of header for "\f\0"
{
int c1=0, c2=0;
if ((c1=getc(archive))!='\f' || (c2=getc(archive))!=0) {
printf("%s: Bad " PROGNAME " header format %d %d\n", argv[1],
c1, c2);
return 1;
}
}
// Extract files from archive data
Encoder e(DECOMPRESS, archive);
for (int i=0; i2)
for (int i=2; i=0)
fprintf(archive, "%ld\t%s\r\n", filesize[i], filename[i].c_str());
}
putc(032, archive); // MSDOS EOF
putc('\f', archive);
putc(0, archive);
// Write data
Encoder e(COMPRESS, archive);
long file_start=ftell(archive);
for (int i=0; i=0) {
uncompressed_bytes+=size;
printf("%-23s %10ld -> ", filename[i].c_str(), size);
FILE* f=fopen(filename[i].c_str(), "rb");
int c;
for (long j=0; j0 && elapsed_time>0) {
printf(" (%1.4f bpc, %1.2f%% at %1.0f KB/s)",
compressed_bytes*8.0/uncompressed_bytes,
compressed_bytes*100.0/uncompressed_bytes,
uncompressed_bytes/(elapsed_time*1000.0));
}
printf("\n");
return 0;
}