NAME

pzpaq - Parallel ZPAQ file compressor and archiver


SYNOPSIS

pzpaq [-123cdehklvx] [-bblocksize] [-mmemory] [-tthreads] [files] ...


GENERAL

pzpaq compresses and decompresses files in the open standard ZPAQ format to achieve a high level of compression. The format is compatible with all ZPAQ level 1 compliant software including zpaq(1), zpipe(1), and libzpaq(3). pzpaq is primarily a file compressor but it can also create and extract archives containing more than one file.

pzpaq supports 3 built in compression levels, but understands others. The three levels are equivalent to the fast, mid, and max levels used in zpaq(1), zp(1), and zpipe(1). Unlike these programs, pzpaq can speed up compression and decompression by dividing input files into independent blocks (losing some compression and requiring more memory) and process them in parallel on multi-core machines.

The following table compares the 3 levels (-1, -2, -3) each with one or two threads (-t1, -t2) with some popular formats on the 14 file Calgary benchmark as a tar(1) file. Compressed size is in bytes. Compression and decompression times are wall (real) times measured on a lightly loaded dual core 2 GHz T3200 under 64 bit Linux. pbzip2 and p7zip are parallel versions of bzip2 and 7zip that run on 2 cores. Memory shows the requirement for compression and decompression respectively.

    Program        Size     Compress Decompress   Memory
    -------      ---------  -------- ----------  --------
    Uncompressed 3,152,896
    compress     1,325,806    0.2 sec  0.05 sec  <1    <1 MB
    gzip -9      1,022,810    0.5      0.05      <1    <1
    bzip2          860,097    0.6      0.3        7     4
    pbzip2         857,292    0.4      0.3       25    16
    p7zip          824,573    1.5      0.1      192    18
    pzpaq -1 -t1   806,972    2.4      2.5       38    38
    pzpaq -1 -t2   818,007    1.5      1.6       76    76
    pzpaq -2 -t1   699,191    7.4      7.5      112   112
    pzpaq -2 -t2   708,526    4.6      4.8      224   224
    pzpaq -3 -t1   644,203   19.6     19.8      247   247
    pzpaq -3 -t2   653,100   13.2     13.5      494   494

The command line interface is similar to that of compress(1), gzip(1), and bzip2(1). The default behavior of pzpaq is to compress each of the files on the command line and replace each file with a single file archive file.zpaq, saving the original filename with a path as specified on the command line, and the original size as a comment. Archives can be concatenated together to produce multi-file archives, or can be compressed all at once to a solid archive to standard output with -c for better compression.

Decompression with -d replaces each file.zpaq with the original file, ignoring any saved filenames. You can also extract using the saved filenames to the current directory with -e or the originally specified directory with -x. File attributes other than the filename such as owner, permissions, or modification date are not preserved.

If any output files or temporary files already exists then they are overwritten.

If no file names are given, then pzpaq compresses or decompresses from standard input to standard output. Data from standard input cannot be processed in parallel. Original filenames and sizes are not saved during compression and are ignored during decompression.


OPTIONS

Options and files may be in any order. Options with arguments should be written without spaces, like -t4. Options without arguments may have other options appended, like -3ct4 for -3 -c -t4. If conflicting options from the set -123dexl are used, then only the last one has any effect.

--

Stop option processing. Any arguments afterward are treated as file names even if they start with -.

-1 -2 -3

Select fast, medium, or best compression respectively. The default is -2 (medium). Higher numbers compress better but are slower and require more memory. Decompression will require about the same time and memory as compression.

-bN

For compression, divide the input into blocks of N bytes which can be compressed in parallel by multiple threads for better speed. -b0 selects a block of effectively infinite size, which means that only one thread can compress each file. Otherwise, the allowed range of block sizes is 4096 to 2147483647 (2^31 - 1). Arguments outside this range will be adjusted to the nearest allowed value.

The default block size is the total size of all input files divided by the number of threads selected by -t, rounding up, but within the permitted range. If any input file has an unknown size, such as when compressing from standard input, then the default is -b0.

Note that a small block size generally makes compression worse, but a large block size can be slower because there cannot be more threads running at once than there are blocks. The default balances the load among threads.

-c

Compress or decompress to standard output and do not delete the input files when done (implies -k). This often compresses better than compressing files individually. The output is concatenated in the order specified on the command line. For decompression, saved filenames are ignored.

-d

Decompress and replace each file.zpaq with file. If file does not have a .zpaq extension, then the output is named file.out. Saved filenames are ignored.

-e

Extract the contents of archive file to the current directory using the saved filenames, ignoring any saved path. Existing files are overwritten. If no filename is saved, then the output is named like -d. Keep file (implies -k).

-h

Print a help message.

-k

Keep input files after successful compression or decompression. The default is to delete them.

-l

List the archive contents of files to standard output. If no files are specified, then list from standard input.

A listing displays for each block in each file, the memory in MB (1,000,000 bytes, rounding up) required to decompress it. For each segment in the block, the first 8 hexadecimal digits of the checksum (if any), saved filename (if any), and comment (if any) are shown.

-mN

Limit the number of threads that can run at one time so that the total memory used is not more than N MB (N million bytes). The default is -b500, which is sufficient for -3 -t2. If any single block still requires more than N MB, then it will still attempt to run.

-tN

Use up to N threads. The default is -t2. The effect is to speed up compression or decompression when more than one core or processor is available by processing up to N blocks in parallel. Memory usage is equal to the sum of the memory requirements of all blocks being processed at the same time.

Reading from standard input implies -t1. Specifying N greater than the number of available cores or processors or greater than the number of input blocks does not increase speed.

The Windows version is limited to 64 threads. Larger values will be interpreted as -t64.

-v

Verbose. Report progress to standard error. Normally there is no output unless there is an error. Errors and warnings are always printed to standard error in either case.

-x

Extract the contents of archive file using the saved filenames like -e but including the saved path. Interpret both backslashes (\) and forward slashes (/) in the saved path as equivalent directory separator characters. Existing files are overwritten. It is an error if the path specifies a directory that does not exist. If no filename is saved then the output is named as with -d. Keep file (implies -k).


EXIT STATUS

Returns 1 if there is an error or 0 otherwise.


ENVIRONMENT

TMPDIR
TEMP

The first of these which is set specifies the directory in which temporary files will be placed. If neither is set, then the default is /tmp. TMPDIR is the POSIX recommendation for the temporary directory, but not all systems set it automatically. TEMP is normally set in Windows to the user's temporary directory.


FILES

See Decompression Optimization below.


CONFORMING TO

pzpaq reads and writes archives in ZPAQ level 1 format (see availability).


NOTES

Archive Format

The ZPAQ level 1 standard defines a ZPAQ archive format. The format is designed to be read or written in a single pass. It consists of one or more blocks that can be read and decompressed independently. Each block contains one or more segments that must be decompressed in sequence from the start of the block. Each block header contains a description of the decompression algorithm. Each segment consists of an optional filename, an optional comment, compressed data, an end of data marker (4 to 7 zero bytes) and an optional 20 byte SHA-1 checksum of the uncompressed data it represents. Segments without a filename are considered to be appended to the previous segment, which might be in the previous block. Blocks may be embedded in arbitrary non-ZPAQ data if they are preceded with an optional 13 byte marker tag. A tag is optional for blocks that start at the beginning of a file or immediately after another block. The filename and comment are NUL terminated strings with no length limit.

The decompression algorithm in the block header is described by two strings, a decoding model and an optional postprocessor, each with a maximum length of 65535 bytes. The first string is stored. The second string, if present, is compressed using the described model at the beginning of the first segment. The first byte string is a program in an interpreted, sandboxed, virtual machine language called ZPAQL that describes a context model and a program to compute contexts. The second string is also a ZPAQL program that inputs the decoded data and outputs the final data. If absent, the decoded data is output directly.

pzpaq always produces archives such that the first block of each file (but not subsequent blocks) is preceded by a marker tag. Compression levels -1, -2, and -3 describe different models, none of which have a postprocessor. Filenames are stored exactly as specified on the command line after wildcard expansion. The comment is stored as a decimal string representing the size of the file. The checksum is always computed and saved and checked upon decompression. When compressing from standard input, the filename and comment fields are left empty.

When compressing with a block size other than infinite (-b0), files may be split into separate blocks. The filename will be stored only for the first block. The comment will be the size of the block after decompression, not the total file size. For subsequent blocks, the comment will have the form size+start where start is the offset from the start of the file as a decimal number. Comments have no effect on decompression but are shown by -l.

When compressing with -c, a block may contain more than one file. For example, if each input file has size 10000 then:

    pzpaq -t2 file1 file2 file3 -c >archive.zpaq
    pzpaq -l archive.zpaq

will show two blocks of 15000 bytes each with file2 split between the two blocks with the second segment unnamed.

    Block 1
      file1 10000
      file2 5000
    Block 2
            5000+5000
      file3 10000

Without -c, then the result would be file1.zpaq, file2.zpaq, and file3.zpaq each as a single block, because each file is small enough to fit in the default block size of 15000. However with -t4 and a default block size of 7500, each file would contain two blocks with one segment each of size 7500 and 2500 respectively.

Decompression Optimization

pzpaq has a feature like the zpaq ox command that speeds up decompression of archives compressed with other ZPAQ compatible programs (like zpaq or future versions of pzpaq). For the built in compression levels, -1, -2, and -3, pzpaq will recognize the compression level from the block header and execute optimized code rather than interpret the instructions. This is typically twice as fast.

Depending on a compile time option, pzpaq will either interpret the ZPAQL code from the archive block header of nonstandard compression levels or will generate and execute new optimized code to improve speed. The latter option generates C++ code and requires an external C++ compiler to compile it before it can be run.

If optimized decompression is enabled, then the following files will typically be installed under Linux:

    /usr/bin/pzpaq
    /usr/lib/zpaq/pzpaq.o
    /usr/lib/zpaq/libzpaq.o
    /usr/include/libzpaq.h

A Windows installation might be as follows, assuming c:\bin were in one's PATH:

    c:\bin\pzpaq.exe
    c:\bin\zpaq\pzpaq.o
    c:\bin\zpaq\libzpaq.o
    c:\bin\zpaq\libzpaq.h

The exact locations of these files will depend on a compile time option during installation. libzpaq.h and pzpaqopt.o are from libzpaq(3). pzpaq.o is from pzpaq with optimization disabled. pzpaq optimization is enabled by compiling with -DOPT set to a command to compile $1.cpp to $1.exe such that it includes the given header file and links to the two object files.

    g++ -O3 -DNDEBUG pzpaq.cpp libzpaq.cpp libzpaqo.cpp -lpthread -o pzpaq \
      -DOPT="\"g++ -O3 \$1.cpp /usr/lib/zpaq/pzpaq.o /usr/lib/zpaq/libzpaq.o -lpthread -o \$1.exe\""
    g++ -O3 -DNDEBUG -c libzpaq.cpp
    g++ -O3 -DNDEBUG -c pzpaq.cpp

Note that pzpaq is not built from pzpaq.o.

The Windows version might be built:

    -DOPT="\"g++ -O3 $1.cpp c:\\bin\\zpaq\\pzpaq.o c:\\bin\\zpaq\\libzpaq.o -Ic:\\bin\\zpaq -o $1.exe\""

In either case, all instances of -O3 should be replaced with appropriate local optimizations. The following are usually appropriate:

    -O3 -march=native -fomit-frame-pointer -s

-DNDEBUG turns off run time checks for better speed, but this has no effect on $1.cpp so it may be omitted in -DOPT.

The -h option will show whether decompression optimization is enabled, and if so, the value of OPT. When enabled and the decompressor detects an archive compressed with a nonstandard level, it will generate a source code file $1.cpp where each instance of $1 in OPT is replaced with a temporary filename, delete $1.exe, execute OPT as a shell command, then test whether $1.exe exists, indicating that compilation succeeded. If so, then $1.exe is a program that decompresses equivalently to pzpaq except faster. pzpaq then passes its arguments to $1.exe, executes it, and deletes the two files $1.cpp and $1.exe. If compilation fails, then pzpaq falls back to decompressing the archive by the slower method of interpreting the archive header. The source file is not deleted in this case.

The temporary file base $1 has the form pzpaqtmppid_0 where pid is the process ID. This prevents file collisions if more than one instance of pzpaq is run at the same time. The files are placed in the temporary directory depending on the environment variables TMPDIR and then TEMP, or /tmp if neither is set.

Parallel Compression

pzpaq compresses and decompresses in parallel by dividing the input into independent blocks. For each output file, the first block is output directly and all other blocks are written to temporary files. When all threads have finished, the temporary files are concatenated to the appropriately named output files and deleted.

Temporary filenames have the form pzpaqtmppid_n where pid is the process ID and n is the file number, 1 or higher. They are placed in the temporary directory determined by TMPDIR, then TEMP, or /tmp if neither is set.

Specifying more threads with -t than there are processor cores does not run any faster. Speedup may be less than linear because threads compete for cache and memory even if they run on separate processors. Also, some processors may reduce clock speed depending on the number of cores in use to regulate temperature or power consumption.

Specifying more threads with -t increases memory consumption. The -m option limits memory usage by limiting the number of threads. When compressing with level -3, each thread requires 247 MB. The default -m500 allows 2 threads. However, archives produced with other programs could require arbitrarily large memory. If a single thread would exceed -m, it would run anyway.

The default block size depends on -t. A larger -t means a smaller default block size. Generally, smaller blocks make compression worse. You can override the default with -b. However, using a smaller block will make compression worse with no increase in speed, and using a larger block will compress slower.

-b may still be useful because decompression speed depends on the number of input blocks chosen during compression. For example, it may be useful to choose a smaller block size when compressing a lot of files at once so that individual files can be split into blocks so they can be decompressed faster.

The Windows version is limited to -t64 due to a limitation of WaitOnMultipleObjects() waiting for worker threads to finish. The Linux version does not have this limitation. It uses the pthread library and uses condition variables to signal the main thread when a worker thread finishes. The limitation in Windows can be removed by compiling with -DPTHREAD and linking with the Windows version of the pthread library (see availability).

-DPTHREAD is implied by -Dunix, which is normally defined automatically in Linux but not Windows. If not defined, Windows is assumed.


EXAMPLE

To compress the files in directory calgary, e.g. replace calgary/book1 to calgary/book1.zpaq:

    pzpaq calgary/*

To decompress and restore the original files:

    pzpaq -d calgary/*.zpaq

To put files into an archive with one block per file:

    pzpaq calgary/*
    cat calgary/*.zpaq > calgary.zpaq

Or for better compression, packing files into 2 blocks:

    pzpaq -c calgary/* > calgary.zpaq

For best possible compression, packing files into 1 block:

    pzpaq -c -3 -b0 calgary/* > calgary.zpaq

To list contents (calgary/book1, etc.):

    pzpaq -l calgary.zpaq

To extract and restore the original files:

    pzpaq -x calgary.zpaq


BUGS

Output files and temporary files are clobbered without warning.

It is possible to make small archives that fill up the disk when decompressed.

Small block sizes, lots of threads, and a high memory limit can strain system resources.

There is no check for archives containing two or more files with the same name (with same or different paths). Multithreaded extraction will probably corrupt both of them and still report valid checksums.

Probably others.


AVAILABILITY

pzpaq, libzpaq, and the ZPAQ level 1 standard specification and reference decoder are available at http://mattmahoney.net/dc/zpaq.html.

A Windows version of the pthreads library is available from http://sourceware.org/pthreads-win32/.


AUTHORS

pzpaq is written by Matt Mahoney.

pzpaq is copyright 2011, Dell Inc. It is licensed under the GNU General Public License (GPL) version 3, or at your option, any higher version. See http://www.gnu.org/licenses/.


SEE ALSO

bzip2(1) cat(1) compress(1) gzip(1) p7zip(1) tar(1) sha1sum(1) zpaq(1) zpipe(1) libzpaq(3)