Octane v1.01.20 - The Open Compression Toolkit for C++ | http://octane.sourceforge.net/ |
Here we provide a simple walkthrough to show you how to use the interactive tester utility using a compressor (SubStrHuffCompressor) that is part of the core Octane Toolkit.
Begin by running octane.exe
, which will launch a console program:
OCTANE v1.01.15 - august 8, 2003 - http://octane.sourceforge.net Commands are case-sensitive (to use as commandline arguments separate with ';') Compression and Decompression 'compress $inputfile [$outputfile]' to compress a file+compressor. 'decompress $inputfile [$outputfile]' to decompress a file+compressor. 'compressr $inputfile [$outputfile]' to compress a file; save only raw data. 'decompressr $inputfile [$outputfile]' to decompress a file of raw data. 'test $text' to test compress and decompress an arbitrary line of $text. 'testfile $inputfile' to test compress and decompress a file. Compressor Creation, Configuration, and File Management 'compressors' to get a list of compressors and compressor classes. 'create $compressorclass' to instantiate a compressor class and select it. 'setinfo $name [$guid]' to set name and GUID for instantiated compressor. 'select [$name|$num|G$guid]' to select an instantiated compressor. 'makestate $inputfile' to create state data using $inputfile. 'save $outputfile' to save current compressor (set GUID with setinfo). 'load $inputfile' to load/instantiate a previously saved compressor. 'delete [$name|$num|G$guid]' to delete an instantiated compressor. Miscellaneous Commands 'debug' to show debugging info for current compressor/dictionary. 'parameters' to see a list of available runtime parameters. 'set $parametername $value' to set a runtime parameter to some value. 'run $commandfile' to run octane commands from a file. 'help' to show these instructions. 'quit' to exit this program. Registered Compressor Types: StatSample SubStrHuff Instantiated Compressors: (none) >When the Octane tester utility launches it displays a brief list of commands and then lists the 'Registered Compressor Types' (in our case there are 2 compressors registered), as well as any 'Instantiated Compressors' (we start with no compressors instantiated). You can redisplay this help at any time by typing the 'help' command (then hit enter).
It's important to understand how the tester utility, as well as the underlying CompressorManager class, differentiates between Registered Compressor Types and Instantiated Compressors. The Registered Types reflect different Compressor classes written by developers in C++ using the Octane Toolkit. Instantiated Compressors are compressors that are built from (or instantiated from) these Registered Types. You could instantiate many compressors based on the same Registered Compressor Type. A registered compressor can be named, parameterized, and trained differently. Instantiated compressors can be saved to and restored from a file. All compression and decompression is done with instantiated compressors, which are themselves built from Registered Compressor types.
Ok, so before we can compress anything, we need to instantiate a compressor to use. If we had previously created and saved one, we could load it now, but let's create and train a new one.
There are two Registered Compressor Types for us to choose from. The StatSample type is not really useful for compression (it's a sample compressor made to show the simplest skeleton needed to make your own compressor), so we will create an instance of the SubStrHuff compressor, which is a statistical compressor which calculates static frequencies of the most popular substrings found in a file, and uses huffman coding to assign bitcodes to each substring symbol. To create the compressor type (the test utility is case sensitive): create SubStrHuff and hit enter:
>create SubStrHuff compressor SubStrHuff.unnamed created and selected.We can see confirmation that we've created a new compressor, and for a little more information on the Registered Compressor Types, use compressors:
>compressors Registered Compressor Types --------------------------- StatSample Sample Statistical Compressor SubStrHuff SubString Huffman Compressor
Instantiated Compressors ------------------------ *0 SubStrHuff.unnamed SubString Huffman Compressor (NOGUID)We now have a new instantiated compressor, SubStrHuff.unnamed, with index 0 and no GUID; the asterisk indicates that this is the currently selected compressor.
We will not use for compressor names or GUIDs in this walkthrough. Their purpose is to provide a globally unique # associated with a compressor in order to specify compressor indentities in very short messages.
Let's get a list of the parameters supported by the SubStrHuff compressor, as well as their current values, by typing parameters:
>parameters maxbuildsymbols 200000 parsing dictionary pruned to not exceed this size maxsymbols 2000 final dictionary pruned to reach this size maxSubstring 12 max size of Substrings in dictionary onlywords 0 only add whole word symbols? (bool) spanwords 0 encode symbols that span words? (bool) crends 1 count CRs as end-of-stream symbols? (bool) minweight 10 minimum threshold weight for final pruning smartlookup 1 when true, use lower_bound stl dictionary search parsemode 0 parse heuristic (0=greedy 1=lff 2=deep&slow) recalculations 0 recompression test&prune iterationsEach compressor type can have its own custom parameters. Let's change one of the parameters, maxSubString, to the value 10, by typing set maxSubstring 10:
>set maxSubstring 10 ok.Now let's "train" this compressor. Some compressors start out ready to perform compression, while others expect to have their state built (this is sometimes thought of as 'priming' or 'training'). For example the traditional file compression algorithms that people are most familiar with, such as the zip algorithm, don't need any pre-training, but the SubStrHuff algorithm is designed to first be trained on some large corpus of text, and then subsequently be used to compress files using the statistical information generated during training.
We will train the SubStrHuff compressor using the book1.txt file, an 800k text file from the Calgary Corpus (which by default is already in the bin directory of Octane along with the octane.exe utility), by typing makestate book1.txt:
>makestate book1.txt Computing state data using SubStrHuff.unnamed: Total symbols in dictionary before pruning: 94273 After Stage 1 Pruning (removed scores below 10), symbols: 14332 After Stage 2 Pruning (removed redundant prefixes), symbols: 10803 After Stage 3 Pruning (killed worst scores to reach target), symbols: 1991 execution time: 4.637 seconds.The SubStrHuffCompressor display some information about the training process, and the CompressorManager provides us with some timing information.
Now that the compressor is trained we can begin to use it to compress files and text.
Let's begin by testing a short text string:
>test Hello my friend, how are you feeling today? Testing SubStrHuff.unnamed.. input length: 43 output length: 22 compressed: ... original: Hello my friend, how are you feeling today? decompressed: Hello my friend, how are you feeling today?The test command uses the currently selected instantiated compressor to compress and then decompress the string specified, and check to make sure that the decompression yielded the original string. The original and compressed size are listed. Note that the compressed string is a binary string that may not display properly - it is used only as a visual aid.
Now let's try to compress a file. There are actually two ways to compress a file from the Octane tester utility. The first way is to produce a standalone file which contains both the raw compressed data, as well as the complete state information for the compressor used to generate it. This produces a bigger file, but one that can be decompressed without any other information (like a standard zip file). The other way is to produce an output file with only the raw compressed data. This yields a smaller data file, but to decompress the file, the compressor must first be recreated. Which method you use will depend on your application, and whether there is a benefit to pre-exchanging compressor state information. For compressors which do not require a separate training stage, there is no difference between these two methods.
Let's begin by compressing book1.txt itself, as a stand-alone file, by typing compress book1.txt:
>compress book1.txt Compressing with SubStrHuff.unnamed: original size: 785393 bytes (book1.txt) compressed size: 355178 bytes (book1.txt.compressed) est. cost of header: 19922 bytes. execution time: 0.691 seconds. compression ratio: 45.223% (3.61784 bits per byte).After compression we are shown a summary of the compression details, as well as an estimate of the amount of space used by the compressor state information (header cost).
We can decompress the created file using the decompress command:
>decompress book1.txt.compressed Recreated temporary compressor from input stream. Decompressing with SubStrHuff.unnamed: original size: 331216 (book1.txt.compressed) decompressed size: 785393 (book1.txt.compressed.decompressed) execution time: 0.22 seconds. compression ratio: 42.172% (3.37376 bits per byte).Note that the tester says "Recreated temporary compressor from input stream." That is because when we compressed the file we saved the entire state of the compressor with the compressed data, so a temporary compressor was actually recreated to perform the decompression and then destroyed on completion.
Let's save this compressor so we can use it another day, by typing save mysubstr.oct:
>save mysubstr.oct Saved compressor file 'mysubstr.oct'. state file size: 23961 bytes.Quit the tester utility by typing 'quit'.
And now let's restart the tester utility and make sure the saving and loading of compressors actually works.
Type load mysubstr.oct to load our previously trained and saved compressor:
>load mysubstr.oct Loaded compressor from file 'mysubstr.oct'. state file size: 23927 bytes. execution time: 0 seconds.We can list the parameters of the compressor to verify that it has remembered all the parameters we set last time:
>parameters maxbuildsymbols 200000 parsing dictionary pruned to not exceed this size maxsymbols 2000 final dictionary pruned to reach this size maxSubstring 10 max size of Substrings in dictionary onlywords 0 only add whole word symbols? (bool) spanwords 0 encode symbols that span words? (bool) crends 1 count CRs as end-of-stream symbols? (bool) minweight 10 minimum threshold weight for final pruning smartlookup 1 when true, use lower_bound stl dictionary search parsemode 0 parse heuristic (0=greedy 1=lff 2=deep&slow) recalculations 0 recompression test&prune iterationsOk, now let's retest the string we tried before to see if it has remembered its training:
>test Hello my friend, how are you feeling today? Testing SubStrHuff.unnamed.. input length: 43 output length: 22 compressed: ... original: Hello my friend, how are you feeling today? decompressed: Hello my friend, how are you feeling today?Good, same results. Lastly, lets make a raw compressed file from book2.txt, another file from the Calgary corpus:
>compressr book2.txt Compressing with SubStrHuff.unnamed: original size: 54083 bytes (book2.txt) compressed size: 22875 bytes (book2.txt.compressed) execution time: 0.051 seconds. compression ratio: 42.2961% (3.38369 bits per byte).Notice that this time we used the 'compressr' command instead of the 'compress' command, so that the compressor state information is *not* saved inside the output file, resulting in less wasted space.
To decompress the raw compressed file we use the decompressr command:
>decompressr book2.txt.compressed Decompressing with SubStrHuff.unnamed: original size: 22875 (book2.txt.compressed) decompressed size: 54083 (book2.txt.compressed.decompressed) execution time: 0.01 seconds. compression ratio: 42.2961% (3.38369 bits per byte).That's the end of the walkthrough for the Octane tester utility.
Generated on 20 May 2004 by doxygen 1.3.3