Octane v1.01.20 - The Open Compression Toolkit for C++ | http://octane.sourceforge.net/ |
#include <substringparser.hpp>
Inheritance diagram for SubstringParser:
It can be configured to build a symbol set using only whole words, or with substrings that are not affected by word boundary (max symbol length can be set). Various input parsing strategies can be used, from greedy (fastest) to longest first, to near optimal (slowest). Various heuristics for pruning the symbol set can be applied, which can be used to get the symbol set down to a desired size.
Definition at line 154 of file substringparser.hpp.
Public Member Functions | |
SubstringParser () | |
Constructor. | |
~SubstringParser () | |
Destructor. | |
virtual bool | CreateSymbolSetUsingStream (bitreader &from) |
Process (train on) an input stream to update/create a symbol set from it. | |
virtual bool | IsReadyToParse () |
are we ready to parse? i.e. has symbol set been built. | |
virtual bool | RewindAnyBufferedInput (bitreader &from) |
Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed. | |
virtual void | SynchronizeStateForNewStream () |
Synchronize state for a new stream - this is called every time a new stream is parsed, to allow parser to get into 'start' state,. | |
virtual int | GetSymbolCount () |
Get a count of the number of symbols stored in the parser. | |
virtual bool | ParseNextSymbolFromInput (bitreader &from, int &symbolnum) |
Parse the next symbol from the input and return its #. | |
virtual bool | WriteSymbolText (bitwriter &to, int symbolnum, bool &isendofstreamsymbol) |
Parse the next symbol from the input, and set symbolnum to the symbol id#,. | |
virtual string | LookupSymbolText (int symbolnum) |
Helper function to return the text string of a symbol number. | |
virtual void | ShowDebuggingInfo () |
show any debugging info on request (used by various derived classes) [optional] | |
virtual unsigned int | GetMemoryUsed () |
Report actual current memory usage (in bytes). | |
virtual unsigned int | GetDiskspaceUsed (bool fortempdecompressiononly) |
Report disk space (in bytes) that would be used when saving state to file (in bytes). | |
virtual std::string | GetParametersInformation () |
Reports information about any parameters available. | |
virtual void | SetDefaultParameters () |
Reset any parameters to their default values. | |
virtual bool | SetParameter (const std::string ¶metername, const std::string ¶metervalue) |
This function is called to set a variable value. | |
virtual bool | SaveState (bitwriter &to, bool fortempdecompressiononly) |
Save state of object to output stream (if appropriate). | |
virtual bool | LoadState (bitreader &from) |
Load state of object from input stream (if appropriate). | |
int | GetSetSymbolCount () |
int | LookupSymbolNumberFromText (string wordstring) |
Protected Types | |
typedef std::set< SubstringSymbol *, TSubstringSymbolStringGreater< SubstringSymbol * > > | TSymbolSetTYPE |
Typedef for stl set of symbols (used for fast alphabetical lookup). | |
typedef std::vector< SubstringSymbol * > | TSymbolVectorTYPE |
Typedef for stl vector of symbols (used for fast lookup by id#). | |
Protected Attributes | |
TSymbolSetTYPE | symbolset |
STL set of symbols (used for fast alphabetical lookup). | |
TSymbolVectorTYPE | symbolvector |
STL vector of symbols (used for fast lookup by id#). | |
TSymbolSetTYPE::iterator | symbolset_pos |
STL set iterator - this is a non-local variable to eliminate cost of building it on stack on every local function. | |
TSymbolVectorTYPE::iterator | symbolvector_pos |
Stl vector iterator - this is a non-local variable to eliminate cost of building it on stack on every local function. | |
char | inputbufferstr [SubStrHuff_BiggestSubStringLen] |
Current window buffer for parsing, filled as we read input characters. | |
int | inputbufferlen |
Length of input window buffer. | |
bool | senteos |
The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet. | |
unsigned int | Parameter_MaxSymbols_DuringBuild |
The dictionary will never be allowed to get bigger than this size (smaller will also increase building speed) when this size is exceeded, we will eliminate all symbols with frequencies of smaller than some small #. | |
unsigned int | Parameter_MaxSymbols_Final |
After the entire training file is parsed, we will repeat pruning steps to reduce it to a *maximum* of this many entries the actual symbols kept may be smaller depending on other pruning parameters. | |
int | Parameter_MaxSubstringSize |
How many characters is largest (sub)string stored in dictionary. | |
bool | Parameter_OnlyCodeWholeWords |
Only store characters and whole words? | |
bool | Parameter_SpanWordBoundaries |
Store Substrings that span words (i.e. they contain multiple words). | |
bool | Parameter_CountCRsAsEOTs |
unsigned int | Parameter_PruneMinimumWeight |
Minimum count for a word or Substring in order to even consider leaving it in the dictionary. | |
int | Parameter_ParseMode |
We can either use a greedy search to find longest Substring starting with first char or we can use smarter but slower heuristics. | |
bool | Parameter_UseSmartLookup |
We can either use a smart search using stl find_lower_bound() to find longest Substring or we can search starting at biggest string backwards. | |
bool | Parameter_PruneReCalculations |
How many iterations should we recompress to test heuristic symbolset cost, if any this tests the currently built symbol set to 'reencode' the input stream and calculates actual usage counts and then reprunes. | |
int | symbolcount |
Keeps track of how many symbols have we created for fast replies. | |
int | endofstreamsymbolnum |
Keeps track of end of stream symbol number for fast replies. | |
bitreader * | current_bitreaderp |
Stores last bitreader , so that we can rewind when we want to reparse and check our heuristic symbolset costs. | |
size_t | start_bitreaderposition |
Stores last bitreader starting position before parsing, so that we can rewind when we want to reparse and check our heuristic symbolset costs. |
|
Process (train on) an input stream to update/create a symbol set from it.
Reimplemented from OctaneParser. Definition at line 281 of file substringparser.cpp. References current_bitreaderp, Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, start_bitreaderposition, and symbolset.
00282 { 00283 // scan a file and add its "tokens" into our symbol set 00284 string valuestring; 00285 string slidingwindowstring; 00286 unsigned char c; 00287 bool slidingwindowfull=false; 00288 string eotstring=string(""); 00289 00290 // clear any existing symbols 00291 FreeData_Symbols(); 00292 // create primitive characters 00293 AddPrimitiveCharacterSubstringSymbols(); 00294 00295 // save the current bitreader and bitreader position for reparsing during pruning 00296 current_bitreaderp = &from; 00297 start_bitreaderposition = from.tell_bit(); 00298 00299 // now parse the file 00300 while (!from.empty()) 00301 { 00302 // grab a character 00303 c=from.get_byte(); 00304 00305 // add the character to our window 00306 slidingwindowstring+=c; 00307 00308 // trim lefthand side if it is too big 00309 if (!slidingwindowfull) 00310 { 00311 // we keep a sliding window which is slightly bigger than the maxSubstring, to account for delimiter boundaries 00312 if ((int)(slidingwindowstring.length())>Parameter_MaxSubstringSize+5) 00313 { 00314 // set the flag saying from now on, we are full 00315 slidingwindowfull=true; 00316 } 00317 } 00318 if (slidingwindowfull) 00319 { 00320 // trim leftmost character 00321 slidingwindowstring.erase(0,1); 00322 } 00323 00324 // now add the appropriate Substrings of slidingwindowstring to our dictionary SymbolSet 00325 AddSubstringsFromSlidingWindowString(slidingwindowstring); 00326 00327 // when encoding single lines, each line must end with an end-of-stream (EOS) symbol 00328 // but if we train on a file, which is large and has only one EOS at end of file 00329 // then the freq. of EOS will be very low (1), so here we can say to count a pretend EOS 00330 // whenever we see a CR 00331 if (c==13 && Parameter_CountCRsAsEOTs) 00332 { 00333 // increment frequency for EOS 00334 UpdateValueStringInSymbolSet(eotstring,1); 00335 } 00336 00337 if (symbolset.size()>Parameter_MaxSymbols_DuringBuild) 00338 { 00339 // periodically we may have to prune away lowest frequency symbols in order to keep dictionary manageable 00340 // how many should we remove - we don't want to remove just one since that would result in many repetitive calls here 00341 // so we just pick some fraction to remove 00342 PruneSymbolSet_ReduceSizeTo(((unsigned int)(symbolset.size())/5)+1); 00343 } 00344 } 00345 00346 // now final pruning 00347 PruneSymbolSet(); 00348 00349 // and now build the vector for fast random access 00350 BuildSymbolVector(); 00351 00352 // return success 00353 return true; 00354 } |
|
Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed. it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream. Reimplemented from OctaneParser. Definition at line 258 of file substringparser.cpp. References inputbufferlen, bitreader::seek_bit(), and bitreader::tell_bit().
00259 { 00260 // let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand 00261 // off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed. 00262 // it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream. 00263 // we have to implement this since we do a read-ahead. note this wouldn't be called in normal circumstances, 00264 // just if compressors wants us to stop prematurely and hand off the job to someone else. 00265 // return true on success 00266 00267 // the number of characters we have read ahead in buffer is inputbufferlen 00268 from.seek_bit(from.tell_bit()-(inputbufferlen*8)); 00269 return true; 00270 } |
|
Parse the next symbol from the input and return its #.
Implements OctaneParser. Definition at line 362 of file substringparser.cpp. References current_bitreaderp, bitreader::get_byte(), SubstringSymbol::get_symbolvectorpos(), inputbufferlen, inputbufferstr, Parameter_MaxSubstringSize, and senteos.
00363 { 00364 // grab an input stream symbol and set its INDEX (in symbol vector) for symbolnum 00365 // return false after EOS 00366 unsigned char c; 00367 int maxSubstringlen=Parameter_MaxSubstringSize; 00368 SubstringSymbol *symbolp; 00369 // note that inputbufferlen are preserved across calls, and inputbufferstr[] is preserved state info 00370 00371 // try to fill up the inputqueue to Parameter_MaxSubstringSize size, or as much as we got 00372 current_bitreaderp=&from; 00373 while (inputbufferlen<maxSubstringlen && !from.empty()) 00374 { 00375 // add the character 00376 c=from.get_byte(); 00377 inputbufferstr[inputbufferlen]=c; 00378 // increment character positions 00379 ++inputbufferlen; 00380 } 00381 00382 if (inputbufferlen==0) 00383 { 00384 // no more symbols left - BUT the question now is, do we return an EOS symbol, or false for no symbols left 00385 if (senteos) 00386 { 00387 // we already sent an EOS so from now on any requests for a symbol returns false saying no more symbols available 00388 return false; 00389 } 00390 else 00391 { 00392 // we are going to drop down to return the EOS signal, but we set flag so we don't do it again 00393 senteos=true; 00394 } 00395 } 00396 00397 // ok now we have a block of up to Parameter_MaxSubstringSize bytes from the left of the inputstr 00398 // now find the longest leftmost (prefix) Substring in our dictionary and encode it, and shift inputqueuestr to the left with remaining bytes 00399 // and return the new inputqueuestr with remaining bytes, and return length of remaining bytes. 00400 symbolp=FindNextSymbolToEncode(inputbufferstr,inputbufferlen); 00401 // symbolp=FindNextSubstringSymbolpFromInputQueueStr(inputbufferstr,inputbufferlen); 00402 00403 symbolnum=symbolp->get_symbolvectorpos(); 00404 inputbufferlen=SwallowSymbolFromInputQueueStr(symbolp,inputbufferstr,inputbufferlen,false); 00405 return true; 00406 } |
|
Parse the next symbol from the input, and set symbolnum to the symbol id#,.
Implements OctaneParser. Definition at line 409 of file substringparser.cpp. References endofstreamsymbolnum, SubstringSymbol::get_valuep(), symbolvector, and bitwriter::write().
00410 { 00411 // write the symbol indexed by symbolnum 00412 // sets isendofostreamsymbol to true or false depending on if the symbol written is the EOS symbol 00413 // return true on success 00414 00415 SubstringSymbol *symbolp=symbolvector[symbolnum]; 00416 int valuelength=(int)((symbolp->get_valuep())->length()); 00417 to.write((symbolp->get_valuep())->c_str(),valuelength); 00418 00419 // set EOS flag 00420 isendofstreamsymbol=(symbolnum==endofstreamsymbolnum); 00421 00422 // return success 00423 return true; 00424 } |
|
Helper function to return the text string of a symbol number.
Implements OctaneParser. Definition at line 235 of file substringparser.hpp. References GetDiskspaceUsed(), GetMemoryUsed(), GetParametersInformation(), LoadState(), SaveState(), SetDefaultParameters(), SetParameter(), ShowDebuggingInfo(), and symbolset.
00235 {assert(symbolnum<(int)(symbolvector.size()));return symbolvector[symbolnum]->get_value();}; |
|
This function is called to set a variable value.
Reimplemented from OctaneClass. Definition at line 178 of file substringparser.cpp. References Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, Parameter_MaxSymbols_Final, Parameter_OnlyCodeWholeWords, Parameter_ParseMode, Parameter_PruneMinimumWeight, Parameter_PruneReCalculations, Parameter_SpanWordBoundaries, Parameter_UseSmartLookup, and OctaneClass::ParseParameter(). Referenced by LookupSymbolText().
00179 { 00180 // generic function for user-interactive modification of parameters 00181 // (see SetDefaultDictionaryBuildingParameters implementation for descriptions) 00182 // return true if we know this variable and use it 00183 bool bretv=false; 00184 if (parametername=="maxbuildsymbols") 00185 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_DuringBuild); 00186 else if (parametername=="maxsymbols") 00187 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_Final); 00188 else if (parametername=="maxSubstring") 00189 bretv=ParseParameter(parametervalue,Parameter_MaxSubstringSize); 00190 else if (parametername=="onlywords") 00191 bretv=ParseParameter(parametervalue,Parameter_OnlyCodeWholeWords); 00192 else if (parametername=="spanwords") 00193 bretv=ParseParameter(parametervalue,Parameter_SpanWordBoundaries); 00194 else if (parametername=="crends") 00195 bretv=ParseParameter(parametervalue,Parameter_CountCRsAsEOTs); 00196 else if (parametername=="minweight") 00197 bretv=ParseParameter(parametervalue,Parameter_PruneMinimumWeight); 00198 else if (parametername=="smartlookup") 00199 bretv=ParseParameter(parametervalue,Parameter_UseSmartLookup); 00200 else if (parametername=="parsemode") 00201 bretv=ParseParameter(parametervalue,Parameter_ParseMode); 00202 else if (parametername=="recalculations") 00203 bretv=ParseParameter(parametervalue,Parameter_PruneReCalculations); 00204 return bretv; 00205 } |
|
Save state of object to output stream (if appropriate).
Reimplemented from OctaneClass. Definition at line 55 of file substringparser.cpp. Referenced by LookupSymbolText().
00056 { 00057 // save state 00058 // return true on success 00059 bool bretv; 00060 bretv=SaveParameters(to); 00061 if (bretv) 00062 bretv=SaveSymbols(to); 00063 return bretv; 00064 } |
|
Load state of object from input stream (if appropriate).
Reimplemented from OctaneClass. Definition at line 66 of file substringparser.cpp. Referenced by LookupSymbolText().
00067 { 00068 // load state 00069 // return true on success 00070 bool bretv; 00071 bretv=LoadParameters(from); 00072 if (bretv) 00073 bretv=LoadSymbols(from); 00074 return bretv; 00075 } |
|
The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet.
Definition at line 175 of file substringparser.hpp. Referenced by ParseNextSymbolFromInput(), SubstringParser(), and SynchronizeStateForNewStream(). |