Octane v1.01.20 - The Open Compression Toolkit for C++ | http://octane.sourceforge.net/ |
While we have tried to make the code as efficient as possible, the primary focus of Octane has been on developing a well-organized and flexible system for experimenting with a variety of compression algorithms.
However, there are compression tasks with different requirements than file compression. For example, consider the problem of compressing text messages in an on-line chat program. In this case, individual messages don't contain enough redundancy to yield effective compression. Instead, one may want to gather statistics "off-line", on some large corpus (for example on a collection of common English text), and then use the "pre-computed" statistics to compress short text messages. Difficult issues remain, such as dealing with messages that may deviate from the pre-computed statistical model (for example what happens if you build a model on English text and the users begin to chat in Chinese?).
While the Octane classes are well suited for file compression, a major focus of the development of Octane has been on supporting alternate, less-common compression tasks such as the on-line chat example above, which can be difficult to implement with existing off-the-shelf compression algorithms.
In dictionary-based compression algorithms, such as the popular Lempel-Ziv derivatives [bell], a dictionary of phrases or strings is built during compression, and references to these dictionary entries can be used in place of the original text. Redundancy in the files leads to higher compression.
In statistical compression, the goal is to explicitly model the probabilities of symbols(characters) during a parsing of the file, and use these predicted probabilities to minimise the bits needed to code observed events.
Statistical compression is based on the principles of information theory [shannon], which describe the minimal bits needed to transmit a message based on the "surprise" that the message represents; a perfectly predictable sequence of symbols requires no bits to code, while describing which outcome occurred out of a set of equally likely possibilities requires the most bits. In statistical compression, the better the prediction of symbols, the less bits are needed to encode the message (note that in statistical compression, the predictions do not have to be perfectly accurate in order to yield significant compression).
The modern statistical approach to compression [witten] describes the compression job as being performed by two semi-independent components.
The first component is the Model, whose job it is to model (or predict) the probability distribution of symbols over time. There are a wide range of algorithms that can be used to model symbol probabilities, from simple static frequency counting, to complex Markovian models, to complex heuristic prediction architectures.
The second component is the Coder. It is the job of the coder to use the probability distributions from the Model to efficiently generate output codes for the compressed file (and to decode these bits during decompression in order to decompress files). In a given context, symbols with high likelyhood of appearing are coded with fewer bits than symbols that are unlikely to occur.
The most well known algorithm for implementing a coder is the Huffman coder [apiki], which is sometimes described as "optimal" in the case where each symbol must be assigned an integral number of bits.
More recently [moffat], efficient implementations of an algorithm known as Arithmetic Coding have become more common. While more complicated than Huffman coding, Arithmetic coding is a theoretically optimal coder, and so is becoming the algorithm of choice in statistical compression.
Octane contains a complete set of classes to support the modern statistical approach to compression, providing semi-independent components for Modeling and Coding. We have extended this idea of separating components by introducing a third component, called a Parser. The job of the parser (a better but less obvious term to hose outside of compiler design might be lexer) is to 'tokenize' the input file(stream) into a set of symbols for processing by the Modeler and Coder. In the simplest case, a parser would simply read characters from a file and assign them a symbol number based on their ascii character code. In more complex cases, a parser might build a symbol set of words or common sub-strings.
Part of the appeal of modern statistical compression comes from the separation of these components, so that replacement Parsers, Modelers, Coders can be interchanged in a plug-and-play fashion. That is, once we have a good modeler and coder, we should be able to use the components with a character based parser, and also with a word based parser, etc.
Rather than the the traditional C language file reading/writing operations seen in many compression algorithms, all Octane functions which perform input and output do so by operating on bitio objects. The bitio classes provide an extremely flexible and transparent interface to working with files, memory, or strings. They provide the standard character-level reading and writing, as well as the bit-level input and output which is needed for compression routines, and additional convenience routines for writing other primitive data types. In addition to the simplification of code that comes with using the bitio classes, they insure that a compression function will work unchanged on files, direct memory, or C++ strings.
Octane uses Object-Oriented Programming (OOP) principles throughout its design. To write a new compression algorithm, you create a new object class derived from the OctaneCompressor class. All Compressors are therefore guaranteed to have a standard interface that they expose to the rest of the system. The base Compressor class also automatically perform some bookkeeping operations behind the scenes, such as timing compression/decompression algorithms and calculating compression ratios, and providing an interface for exposing certain parameters for user adjustment.
At a higher level of organization, Octane uses the Factory Pattern [gamma], to manage the collection of implemented compressors, and dynamically create compressor's when it needs to. In this way, Octane is not simply a collection of independent compressors that are instantiated by a target application. Compressor classes include code that automatically registers them with a CompressorManager object, which manages a list of all known compressors, and is able to dynamically instantiate compressors on demand. The CompressorManager provides a higher-level interface to a collection of compressors.
While the Octane Toolkit can support any kind of compression algorithm, it includes a a substantial set of classes specifically designed for modern statistical compression. These classes include the base OctaneModeler, OctaneCoder, and OctaneParser classes, as well as sample derivations of each of these classes which may serve as the basis for writing new compressors.
Octane also includes substantial code for helping you to test and evaluate new compressors. An OctaneTester class utility shows how to work with the OctaneManager to dynamically create and use compressors, and parameterize them. It can be used to interact with new compressors, compress/decompress files and strings, get timing information, etc. It can be run in a non-interactive command-line mode, or in an interactive console mode.