Octane v1.01.20 - The Open Compression Toolkit for C++

The Octane Compressor API - How to Write Your Own Octane Compressor Class

Writing Compressors for Octane

The main design goal of the Octane Open Compression Toolkit was to make it easy for programmers to develop and test well-behaved compression algorithms, without having to worry about many of the low level issues such as bitwise io and file operations. Octane provides an object-oriented framework which guarantees a standard interface to every compressor, which means that any compression algorithm you write becomes transparently available to all applications that uses the Octane Toolkit.

When you write a new compression-decompression algorithm, you do so by implemented a class derived from the base OctaneCompressor class. A detailed list of the member functions which make up the Octane Compressor API (program interface) is listed in the documentation for the OctaneCompressor class. This section is an attempt to explain the details of this API and describe the process by which you would go about writing your own custom Compressor class.

In addition to this document, we have provided a complete set of files for implementing a bare-minimum statistical compressor, which you can use as the skeleton for creating new compressors (see SampleCompressor, SampleModeler, SampleCoder, SampleParser).

The Octane API - Basic Runtime Type Information

Almost all classes in Octane derive from the base OctaneClass class. This class defines some basic virtual functions which are designed to provide some RTTI (run-time type information) to other classes. Your compressor class should at the minimum implement the GetDescription() member function, which returns a short (best if less than 60 character) description of the compressor class. Optionally you might provide a more elaborate description of your compressor class by implementing the GetHelpInformation() function.

The Octane API - State Information

The OctaneClass interface also defines functions for reporting memory/disk usage and debugging information. Its important that your compressor accurately report its memory usage, via GetMemoryUsed(), and the disk space required to store the state of your compression, via GetDiskspaceUsed(), so that accurate information is displayed when reporting performance of the compressor. You can optionally implement the ShowDebuggingInfo() function in order to display some internal details of the current compressor on request (for example showing the currently built dictionary).

In order to provide a uniform interface for working with compressor parameters, the OctaneClass defines a set of functions for setting and querying parameters which you define. By exposing compressor parameters with this interface you provide an easy way to test various settings either through command-line arguments or through the interactive tester console. Three functions make up the parameter interface: GetParametersInformation() which should list all parameters, their descriptions, and their current values, SetDefaultParameters() which should reset the parameters to their default values, and SetParameter() which sets a parameter to a given value. If your compressor has no adjustable parameters, you do not need to implement these functions.

If your compressor has any user modifiable parameters and/or contains any state information (i.e. if it builds state information by processing a training file), then you need to implement the LoadState() and SaveState() functions, which are responsible for saving and loading state information to a bitstream. The bitio classes provide convenient functions for saving and loading various data types, and do not require you to worry about the actual source or destination (which is typically a file).

The OctaneCompressor API - Auto Registration

If you examine the sample statistical compressor class (samplestatcompressor.cpp), you will see an unusual snippet of code in the sample compressor Constructor:

	// ...
	// REGISTER WITH COMPRESSOR MANAGER
	//  this use a trick to make sure the global manager is ready for registration
 	//  and at the same time automatically register this instance with the global manager
 	// note that the insurer is just temporary, and deletes when we exit the procedure
 	if (registerme)
		CompressorManager_SingletonInsurer managerinsurance(this);
	// ...

This code is designed to automatically register the compressor with a global CompressorManager object. This mechanism insures that the global compressor manager is informed about all the different compressor types no matter where they are defined, without having to add explicit code to register each new compressor in some central source code file.

The other piece of code that is required to make this scheme work is to add a global instance of the compressor:

	// Create a global instance in order to register it automatically with the global manager
	SampleStatCompressor GloballyInstantiated_SampleStatCompressor(true);

This global instance of the derived compressor also serves another important purpose, it creates a compressor object which acts as a "Factory" for dynamically creating new instances of the compressor at runtime [gamma].

To support the dynamic creation of new derived compressors (A Simple Walkthrough with the Octane Tester Console "create command"), your derived compressor class must implement the virtual MakeCompressorInstantiation() function, which should create a new instance of your compressor. In this way, the CompressorManager and tester utility can create new instances of your compressor when needed.

The OctaneCompressor API - Runtime Type Information

The OctaneCompressor class provides some additional runtime type information classes which your derived compressor needs to support. The GetClassName() needs to be implemented by you and should return the name of your compressor class. You do not need to implement the GetName() and GetInstantiatedName(), as they are automatically handled by the base OctaneCompressor class.

The OctaneCompressor API - Miscelaneous Functions

The OctaneCompressor class defines a variety of virtual functions which you can override in your derived compressor class if appropriate. We've tried to document these functions well enough in the header file (compressor.hpp) to make the functions self-explanatory. Many of these functions have suitable default implementations so that you do not need to reimplement them in your derived class.

	// ...
	// Reset the state of the compressor to its initial state
	virtual void ResetState() {;};
	// Synchronize state for a new stream - this will be called each time before beginning a new parsing stream
	virtual void SynchronizeStateForNewStream() {;};
	// Prepare for parsing mode, this will be called parsing begins.
	// return true on success
	virtual bool PrepareForCompression() {return true;};
	// Some compressors contains parsers which begin ready to parse, without any 'training'.
	// Here is where such a parser should be initialized.
	virtual void SetupAnyDefaultParser() {;};
	// is the compressor ready to compress? Normally this is false until a dictionary is built or loaded.
	// return true on success
	virtual bool IsReadyToCompress() {return true;};
	// ...

The OctaneCompressor API - Main Compression/Decompression Functions

There are two main functions that you must implement in order to support compression and decompression in your derived compressor class:

	// This is the member function which does the actual compression, and this is the function that should be subclasses by derived classes.
	//  It is wrapped by the public API call Compress() which performs some measurements on compression speed and streamsize.
	// note this function should never be called by an outside class - the API interface should be called instead.
	// return true on success
	virtual bool DoProtectedCompress(bitreader &from, bitwriter &to)=0;
	// This is the member function which does the actual decompression, and this is the function that should be subclasses by derived classes.
	//  It is wrapped by the public API call Decompress() which performs some measurements on decompression speed and streamsize.
	// note this function should never be called by an outside class - the API interface should be called instead.
	// return true on success
	virtual bool DoProtectedDecompress(bitreader &from, bitwriter &to)=0;

All compression and decompression is to and from bitio classes, which provide a very flexible interface for reading and writing, and allow your compress and decompress routines to operate transparently on files, strings, and raw memory. The reason that these functions are named the way they are (beginning with "DoProtected...") is that they are not meant to be called directly in order to compress or decompress data. Instead, external objects will invoke the base Compress() and Decompress() functions exposed by the base OctaneCompressor class, which wrap these protected functions and perform some additional operations.

The OctaneCompressor API - State Information

For all compressors that maintain state information (i.e. are trainable), you will need to provide implementations of some virtual functions for saving, loading, and creating state:

	// ...
	// This is the member function which does the actual training on an input in order to generate parsers/modelers/coders,
	//  and this is the function that should be subclasses by derived classes.
	//  It is wrapped by the public API call CreateSymbolsAndModelsUsingStream() which performs some measurements on speed and streamsize.
	// note this function should never be called by an outside class - the API interface should be called instead.
	// return true on success
	virtual bool DoProtectedCreateSymbolsAndModelsUsingStream(bitreader &from) {return true;};
	// This is the member function which does the actual saving of state,
	//  and this is the function that should be subclasses by derived classes.
	//  It is wrapped by the public API call SaveState() which performs some measurements on speed and streamsize.
	// note this function should never be called by an outside class - the API interface should be called instead.
	// return true on success
	virtual bool DoProtectedSaveState(bitwriter &to,bool fortempdecompressiononly) {return true;};
	// This is the member function which does the actual loading of state,
	//  and this is the function that should be subclasses by derived classes.
	//  It is wrapped by the public API call LoadState() which performs some measurements on speed and streamsize.
	// note this function should never be called by an outside class - the API interface should be called instead.
	// return true on success
	virtual bool DoProtectedLoadState(bitreader &from) {return true;};
	// ...

Again you'll notice that these functions begin with "DoProtected..." to signify that they are wrapped by publicly exposed functions. The DoProtectedCreateSymbolsAndModelsUsingStream() function is the one you will override in order to train your compressor based on an input file. The fortempdecompressiononly argument in the DoProtectedSaveState() function will be set to true when the compressor state is being saved in the header of a self-contained compressed file; the purpose of this flag is to allow your save state function to save only the minimal information needed to decompress a file - it can skip saving parameters and other information that is not needed for decompression.

Statistical Compressors - Overview

In the previous sections we have reviewed the basic Compressor API that is the core of all compressors. Octane also contains some higher level compressor classes to make it even easier for you to write certain kinds of compressors. The remainder of this section will describe how to build a new statistical compressor by writing your own parsers, modelers, and coders.

The modern (statistical) approach to compression algorithms treats the compressor as made up of two components, a Model(er) and a Coder. The statistical compressor framework provided in Octane extends this idea by introducing a third component, a Parser, which is responsible for tokenizing the input/output stream into discrete symbols. See the Introduction to Octane for a more detailed description of these components and statistical compression.

Statistical Compressors - The OctaneCompressor_Statistical Class

To write a new statistical compressor, you will begin by deriving your compressor from the OctaneCompressor_Statistical class. The API of this class is identical to the API of the OctaneCompressor class described earlier, except that the default implementations of the state management and compression decompression functions do not have to be changed by you at all.

This is because these functions delegate most work to the three components: an OctaneParser, an OctaneModeler, and an OctaneCoder.

For your derived compressor you will generally only need to implement the constructor, the GetClassName() function, and the GetDescription() function.

The real work will be done in implementing your derived Parser, Modeler, and Coder.

Before we describe these components, we note that to interface them with the compressor, you simply create and assign them in the constructor of your derived statistical compressor. For example, from the sample statistical compressor:

SampleStatCompressor::SampleStatCompressor(bool registerme) :
OctaneCompressor_Statistical(registerme)
{
	// constructor: create the children objects
	parserp=new SampleParser();
	modelerp=new SampleModeler();
	coderp=new SampleCoder();
	// REGISTER WITH COMPRESSOR MANAGER
	//  this use a trick to make sure the global manager is ready for registration
 	//  and at the same time automatically register this instance with the global manager
 	// note that the insurer is just temporary, and deletes when we exit the procedure
 	if (registerme)
		CompressorManager_SingletonInsurer managerinsurance(this);
	// now default setup of components
	SetupAnyDefaultParser();
}

OctaneCompressor_Statistical::~OctaneCompressor_Statistical()
{
	// destructor: delete children objects
	if (coderp!=NULL)
		delete coderp;
	if (modelerp!=NULL)
		delete modelerp;
	if (parserp!=NULL)
		delete parserp;
}

It's important to note that statistical compressors are designed so that the various components can be interchanged in a plug and play fashion. This means that often when you write a new compressor you may only be writing one new component and combining that with other existing parsers, modelers, or coders.

Statistical Compressors - The OctaneParser Class

The OctaneParser class defines a standard API with some default functions that you can override in your derived class. The main functions you may want to override are described below:

Some parsers may begin with a fixed symbol set (such as the ascii character), while others will build symbol sets by parsing a training file. You also need to be able to reset the parser to a default state when parsing new streams:

	// Process (train on) an input stream to update/create a symbol set from it.
	// return true on success
	virtual bool CreateSymbolSetUsingStream(bitreader &from) {return true;};
	// Synchronize state for a new stream - this will always be called before beginning a new parsing stream, and
	// should be used to reset the parse into any initial predictable state.
	virtual void SynchronizeStateForNewStream() {;};

In order to use the parser, other components of the statistical compressor will need to be able to access the parser symbols:

	// Get a count of the number of symbols stored in the parser.
	virtual int GetSymbolCount()=0;
	// Helper function to return the text string of a symbol number.
	// return a string with the text of the symbol
	// note this should be the empty string to signify end of stream symbol.
	virtual string LookupSymbolText(int symbolnum)=0;

The main functions that do the work of the parser are those that read symbols from the input stream and write symbols to the output stream:

	/// Parse the next symbol from the input and return its #.
	/// note The end of stream situation must to be handled specially:
	/// note When a parser encounters the end of a stream, it *MUST* return a symbol signifying an end-of-stream symbol.
	// note This end of stream symbol must be a unique symbol from the symbol set.
	// return false *after* the end of stream symbol is returned on prior call.
	virtual bool ParseNextSymbolFromInput(bitreader &from,int &symbolnum)=0;
	// Parse the next symbol from the input, and set symbolnum to the symbol id#,
	// return false *after* end of stream (i.e. first response at end of stream should be the end-of-stream symbol).
	virtual bool WriteSymbolText(bitwriter &to, int symbolnum,bool &isendofstreamsymbol)=0;

Notice that the parser interacts with other components in the statistical compressors by referring to symbol numbers. This is an important part of keeping the components modular. Only the Parser component knows what a symbol represents, whether it is a single character or a long phrase. The rest of the components just work on the basis of symbol #s.

Statistical Compressors - The OctaneModeler Class

The job of the OctaneModeler is to operate between the parser and the coder, producing continuously updated probability distributions over the symbolset created by the parser.

All of the components that make up a statistical compressor are responsible for keeping their own independent data structures. This design decision can lead to some duplication in data structures, but it provides maximum independence for these different components.

Like the OctaneParser, the OctaneModeler has functions for saving state, loading state, and creating state (training) from an input stream. Since training a modeler is so coupled to the symbolset built by the parser, the state building functions of the modeler both take a pointer to a parser. There are two alternative functions which may be called, depending on whether the parser begins initialized or not:

	// Build a model using a stream and a parser.
	// return true on success.
	virtual bool CreateModelUsingStream(OctaneParser *parserp, bitreader &from) {return true;};
	// Build a model just using parser.
	// This is called on startup, before any possible training, when a default parser exists.
	// note that no input (training) stream is provider; some modelers may ignore this.
	// return true on success
	virtual bool CreateModelUsingParser(OctaneParser *parserp) {return false;};
	// repare for modeling mode given a trained parser.
	// return true on success
	virtual bool PrepareForModeling(OctaneParser *parserp) {return true;};

The main work function that you must implement in your modeler is the function that is called as each symbol is parsed for compression, or decoded during decompression. The job of this function is to recompute the probability distribution after each symbol. This function is called by the statistical compressor during compression and decompression, and it passes a pointer to a Coder:

	// Update our model after each symbol is received during compression and decompression.
	// Modelers track symbols during parsing and update the probability distribution predictions
	//  for upcoming symbols.  This function should notify the coder of changes to probabilities.
	// return true on success
	virtual bool UpdateModelAfterReceivingSymbol(int symbolnumber, OctaneCoder *coderp) {return true;};

The reason that the compressor passes the coder pointer to the modeler when symbol probabilities are updated, is because it is the job of the modeler to inform the coder when symbol probabilities change. This somewhat unusual arrangement is used so that if the modeler decides not to change symbol probabilities (either because the model is static, or because the current context does not change), then the coder should not have to rearrange its state, which can often be quite expensive.

A sample zero-order modeler (ZeroOrderModeler) is provided that counts static symbol frequencies given an arbitrary parser.

To inform the Coder that symbol probabilities have changed, the Modeler makes a call to one of two notification member functions exposed by the coder, and described in detail below: ReceiveNotification_ModelChange_AllSymbolWeights(), and ReceiveNotification_ModelChange_SingleSymbolWeight().

The coder in turn will want to make calls to the modeler to get values for the symbol weights (probabilities). Two functions are provided, one very general which your modeler must support, which makes no assumptions about the underlying data structure used by the modeler to store symbol weights, and another optional accessor function for providing a more efficient interface to a symbol weight vector, for modelers which use a standard weightvector class which is provided with Octane (SymbolWeightVector):

	// Return the number of symbols in the modeler (should be same as number of symbols in the parser); called by coder.
	virtual int GetSymbolCount() {return 0;};
	// Return a pointer to the current weight vector for symbols (used by coder) [optional]
	// This is provided as a shortcut function to avoid having to make numerous calls to GetWeightVectorItem() below.
	// Not all modelers will maintain probabilities in the form of a SymbolWeightVector, and a modeler can return NULL in order to
	//  force the coder to get probabilities for each symbol individually.
	// return a pointer to a local SymbolWeightVector or NULL if it is not available.
	virtual SymbolWeightVector* GetWeightVectorp() {return NULL;};
	// Return a specific weight index value; (used by coder).
	virtual TSymbolWeightVectorWeight GetWeightVectorItem(int symbolnum) {return 0;};

A derived modeler which explicitly uses this weightvector is provided to make creating custom modelers even easier (OctaneModeler_WeightVectored).

Statistical Compressors - The OctaneCoder Class

The job of the coder component during compression is to assign bitcodes to each symbol number. During decompression the job of the coder is to read bits from the input stream and translate them into symbol numbers for translation by the parser.

The main work of the coder is performed in two functions which you need to override for your coder:

	// Write bits for symbol number
	// return true on success
	virtual bool WriteSymbolBits(int symbolnum, bitwriter &bw)=0;
	// Decode the next symbol from the input
	// return true on success, false when there are no more symbols to read.
	virtual bool DecodeSymbolFromInput(int &symbolnum, bitreader &br)=0;

Note that as with the Modeler, the coder cares only about symbol numbers, and knows nothing about what each symbol number represents, whether it is a single character, a word, or an arbitrarily large chunk of bytes.

Your coder must implement functions to receive notifications from the modeler for when symbol weights (probabilities) change:

	// Notify coder that all probabilities are being updated.
	// We can be informed by a model when the underlying probabilities are changing
	//  this part of the API exists so that we can flexibly and efficiently
	//  handle different kinds of situations where the coder needs to be synchronized
	//  with the model probabilities.  in some cases, a coder may ignore these
	//  messages and simply rebuild its internal datastructure on each coding event
	//  in other cases, it will want to incrementally update as model changes.
	// Two kinds of notifications are supported, depending on how the model updates
	//  itself during processing, whether it updates one symbol per iteration, or
	//  modifies all symbols at once.
	virtual void ReceiveNotification_ModelChange_AllSymbolWeights(OctaneModeler *modelerp) {;};
	// notify that a single symbol probability has changed (by default just calls AllSymbolWeights change above)
	// note the SingleSymbolWeight() call will NOT be made when all weights change
	virtual void ReceiveNotification_ModelChange_SingleSymbolWeight(OctaneModeler *modelerp,int symbolnum);

The pointer to the calling modeler is provided in these calls, so that the coder can in turn call the modeler to get the actual values for symbol weights, using the following modeler member functions described earlier: GetSymbolCount(), GetWeightVectorp(), and GetWeightVectorItem().

A working Huffman Coder (HuffmanCoder) is provided with the Octane classes to show a real working coder.

Generated on 20 May 2004 by doxygen 1.3.3