| Octane v1.01.20 - The Open Compression Toolkit for C++ | http://octane.sourceforge.net/ | 
#include <substringparser.hpp>
Inheritance diagram for SubstringParser:

It can be configured to build a symbol set using only whole words, or with substrings that are not affected by word boundary (max symbol length can be set). Various input parsing strategies can be used, from greedy (fastest) to longest first, to near optimal (slowest). Various heuristics for pruning the symbol set can be applied, which can be used to get the symbol set down to a desired size.
Definition at line 154 of file substringparser.hpp.
Public Member Functions | |
| SubstringParser () | |
| Constructor.  | |
| ~SubstringParser () | |
| Destructor.  | |
| virtual bool | CreateSymbolSetUsingStream (bitreader &from) | 
| Process (train on) an input stream to update/create a symbol set from it.   | |
| virtual bool | IsReadyToParse () | 
| are we ready to parse? i.e. has symbol set been built.  | |
| virtual bool | RewindAnyBufferedInput (bitreader &from) | 
| Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed.   | |
| virtual void | SynchronizeStateForNewStream () | 
| Synchronize state for a new stream - this is called every time a new stream is parsed, to allow parser to get into 'start' state,.  | |
| virtual int | GetSymbolCount () | 
| Get a count of the number of symbols stored in the parser.  | |
| virtual bool | ParseNextSymbolFromInput (bitreader &from, int &symbolnum) | 
| Parse the next symbol from the input and return its #.   | |
| virtual bool | WriteSymbolText (bitwriter &to, int symbolnum, bool &isendofstreamsymbol) | 
| Parse the next symbol from the input, and set symbolnum to the symbol id#,.   | |
| virtual string | LookupSymbolText (int symbolnum) | 
| Helper function to return the text string of a symbol number.   | |
| virtual void | ShowDebuggingInfo () | 
| show any debugging info on request (used by various derived classes) [optional]  | |
| virtual unsigned int | GetMemoryUsed () | 
| Report actual current memory usage (in bytes).  | |
| virtual unsigned int | GetDiskspaceUsed (bool fortempdecompressiononly) | 
| Report disk space (in bytes) that would be used when saving state to file (in bytes).  | |
| virtual std::string | GetParametersInformation () | 
| Reports information about any parameters available.  | |
| virtual void | SetDefaultParameters () | 
| Reset any parameters to their default values.  | |
| virtual bool | SetParameter (const std::string ¶metername, const std::string ¶metervalue) | 
| This function is called to set a variable value.   | |
| virtual bool | SaveState (bitwriter &to, bool fortempdecompressiononly) | 
| Save state of object to output stream (if appropriate).   | |
| virtual bool | LoadState (bitreader &from) | 
| Load state of object from input stream (if appropriate).   | |
| int | GetSetSymbolCount () | 
| int | LookupSymbolNumberFromText (string wordstring) | 
Protected Types | |
| 
typedef std::set< SubstringSymbol *, TSubstringSymbolStringGreater< SubstringSymbol * > >  | TSymbolSetTYPE | 
| Typedef for stl set of symbols (used for fast alphabetical lookup).  | |
| typedef std::vector< SubstringSymbol * > | TSymbolVectorTYPE | 
| Typedef for stl vector of symbols (used for fast lookup by id#).  | |
Protected Attributes | |
| TSymbolSetTYPE | symbolset | 
| STL set of symbols (used for fast alphabetical lookup).  | |
| TSymbolVectorTYPE | symbolvector | 
| STL vector of symbols (used for fast lookup by id#).  | |
| TSymbolSetTYPE::iterator | symbolset_pos | 
| STL set iterator - this is a non-local variable to eliminate cost of building it on stack on every local function.  | |
| TSymbolVectorTYPE::iterator | symbolvector_pos | 
| Stl vector iterator - this is a non-local variable to eliminate cost of building it on stack on every local function.  | |
| char | inputbufferstr [SubStrHuff_BiggestSubStringLen] | 
| Current window buffer for parsing, filled as we read input characters.  | |
| int | inputbufferlen | 
| Length of input window buffer.  | |
| bool | senteos | 
| The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet.   | |
| unsigned int | Parameter_MaxSymbols_DuringBuild | 
| The dictionary will never be allowed to get bigger than this size (smaller will also increase building speed) when this size is exceeded, we will eliminate all symbols with frequencies of smaller than some small #.  | |
| unsigned int | Parameter_MaxSymbols_Final | 
| After the entire training file is parsed, we will repeat pruning steps to reduce it to a *maximum* of this many entries the actual symbols kept may be smaller depending on other pruning parameters.  | |
| int | Parameter_MaxSubstringSize | 
| How many characters is largest (sub)string stored in dictionary.  | |
| bool | Parameter_OnlyCodeWholeWords | 
| Only store characters and whole words?  | |
| bool | Parameter_SpanWordBoundaries | 
| Store Substrings that span words (i.e. they contain multiple words).  | |
| bool | Parameter_CountCRsAsEOTs | 
| unsigned int | Parameter_PruneMinimumWeight | 
| Minimum count for a word or Substring in order to even consider leaving it in the dictionary.  | |
| int | Parameter_ParseMode | 
| We can either use a greedy search to find longest Substring starting with first char or we can use smarter but slower heuristics.  | |
| bool | Parameter_UseSmartLookup | 
| We can either use a smart search using stl find_lower_bound() to find longest Substring or we can search starting at biggest string backwards.  | |
| bool | Parameter_PruneReCalculations | 
| How many iterations should we recompress to test heuristic symbolset cost, if any this tests the currently built symbol set to 'reencode' the input stream and calculates actual usage counts and then reprunes.  | |
| int | symbolcount | 
| Keeps track of how many symbols have we created for fast replies.  | |
| int | endofstreamsymbolnum | 
| Keeps track of end of stream symbol number for fast replies.  | |
| bitreader * | current_bitreaderp | 
| Stores last bitreader , so that we can rewind when we want to reparse and check our heuristic symbolset costs.  | |
| size_t | start_bitreaderposition | 
| Stores last bitreader starting position before parsing, so that we can rewind when we want to reparse and check our heuristic symbolset costs.  | |
      
  | 
  
| 
 Process (train on) an input stream to update/create a symbol set from it. 
 
 Reimplemented from OctaneParser. Definition at line 281 of file substringparser.cpp. References current_bitreaderp, Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, start_bitreaderposition, and symbolset. 
 00282 {
00283         // scan a file and add its "tokens" into our symbol set
00284         string valuestring;
00285         string slidingwindowstring;
00286         unsigned char c;
00287         bool slidingwindowfull=false;
00288         string eotstring=string("");
00289         
00290         // clear any existing symbols
00291         FreeData_Symbols();
00292         // create primitive characters
00293         AddPrimitiveCharacterSubstringSymbols();
00294         
00295         // save the current bitreader and bitreader position for reparsing during pruning
00296         current_bitreaderp = &from;
00297         start_bitreaderposition = from.tell_bit();
00298 
00299         // now parse the file
00300         while (!from.empty())
00301                 {
00302                 // grab a character
00303                 c=from.get_byte();
00304 
00305                 // add the character to our window
00306                 slidingwindowstring+=c;
00307 
00308                 // trim lefthand side if it is too big
00309                 if (!slidingwindowfull)
00310                         {
00311                         // we keep a sliding window which is slightly bigger than the maxSubstring, to account for delimiter boundaries
00312                         if ((int)(slidingwindowstring.length())>Parameter_MaxSubstringSize+5)
00313                                 {
00314                                 // set the flag saying from now on, we are full
00315                                 slidingwindowfull=true;
00316                                 }
00317                         }
00318                 if (slidingwindowfull)
00319                         {
00320                         // trim leftmost character
00321                         slidingwindowstring.erase(0,1);
00322                         }
00323 
00324                 // now add the appropriate Substrings of slidingwindowstring to our dictionary SymbolSet
00325                 AddSubstringsFromSlidingWindowString(slidingwindowstring);
00326 
00327                 // when encoding single lines, each line must end with an end-of-stream (EOS) symbol
00328                 //  but if we train on a file, which is large and has only one EOS at end of file
00329                 //  then the freq. of EOS will be very low (1), so here we can say to count a pretend EOS
00330                 //  whenever we see a CR
00331                 if (c==13 && Parameter_CountCRsAsEOTs)
00332                         {
00333                         // increment frequency for EOS
00334                         UpdateValueStringInSymbolSet(eotstring,1);
00335                         }
00336 
00337                 if (symbolset.size()>Parameter_MaxSymbols_DuringBuild)
00338                         {
00339                         // periodically we may have to prune away lowest frequency symbols in order to keep dictionary manageable
00340                                 // how many should we remove - we don't want to remove just one since that would result in many repetitive calls here
00341                         //  so we just pick some fraction to remove
00342                         PruneSymbolSet_ReduceSizeTo(((unsigned int)(symbolset.size())/5)+1);
00343                         }
00344                 }
00345 
00346         // now final pruning
00347         PruneSymbolSet();
00348 
00349         // and now build the vector for fast random access
00350         BuildSymbolVector();
00351         
00352         // return success
00353         return true;
00354 }
 | 
  
      
  | 
  
| 
 Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed. it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream. Reimplemented from OctaneParser. Definition at line 258 of file substringparser.cpp. References inputbufferlen, bitreader::seek_bit(), and bitreader::tell_bit(). 
 00259 {
00260         // let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand
00261         //  off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed.
00262         //  it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream.
00263         // we have to implement this since we do a read-ahead.  note this wouldn't be called in normal circumstances,
00264         //  just if compressors wants us to stop prematurely and hand off the job to someone else.
00265         // return true on success
00266 
00267         // the number of characters we have read ahead in buffer is inputbufferlen
00268         from.seek_bit(from.tell_bit()-(inputbufferlen*8));
00269         return true;
00270 }
 | 
  
      
  | 
  ||||||||||||
| 
 Parse the next symbol from the input and return its #. 
 
 
 Implements OctaneParser. Definition at line 362 of file substringparser.cpp. References current_bitreaderp, bitreader::get_byte(), SubstringSymbol::get_symbolvectorpos(), inputbufferlen, inputbufferstr, Parameter_MaxSubstringSize, and senteos. 
 00363 {
00364         // grab an input stream symbol and set its INDEX (in symbol vector) for symbolnum
00365         // return false after EOS
00366         unsigned char c;
00367         int maxSubstringlen=Parameter_MaxSubstringSize;
00368         SubstringSymbol *symbolp;
00369         // note that inputbufferlen are preserved across calls, and inputbufferstr[] is preserved state info
00370 
00371         // try to fill up the inputqueue to Parameter_MaxSubstringSize size, or as much as we got
00372         current_bitreaderp=&from;
00373         while (inputbufferlen<maxSubstringlen && !from.empty())
00374                 {
00375                 // add the character
00376                 c=from.get_byte();
00377                 inputbufferstr[inputbufferlen]=c;
00378                 // increment character positions
00379                 ++inputbufferlen;
00380                 }
00381         
00382         if (inputbufferlen==0)
00383                 {
00384                 // no more symbols left - BUT the question now is, do we return an EOS symbol, or false for no symbols left
00385                 if (senteos)
00386                         {
00387                         // we already sent an EOS so from now on any requests for a symbol returns false saying no more symbols available
00388                         return false;
00389                         }
00390                 else
00391                         {
00392                         // we are going to drop down to return the EOS signal, but we set flag so we don't do it again
00393                         senteos=true;
00394                         }
00395                 }
00396 
00397         // ok now we have a block of up to Parameter_MaxSubstringSize bytes from the left of the inputstr
00398         // now find the longest leftmost (prefix) Substring in our dictionary and encode it, and shift inputqueuestr to the left with remaining bytes
00399         //  and return the new inputqueuestr with remaining bytes, and return length of remaining bytes.
00400         symbolp=FindNextSymbolToEncode(inputbufferstr,inputbufferlen);
00401 //      symbolp=FindNextSubstringSymbolpFromInputQueueStr(inputbufferstr,inputbufferlen);
00402 
00403         symbolnum=symbolp->get_symbolvectorpos();
00404         inputbufferlen=SwallowSymbolFromInputQueueStr(symbolp,inputbufferstr,inputbufferlen,false);
00405         return true;
00406 }
 | 
  
      
  | 
  ||||||||||||||||
| 
 Parse the next symbol from the input, and set symbolnum to the symbol id#,. 
 
 Implements OctaneParser. Definition at line 409 of file substringparser.cpp. References endofstreamsymbolnum, SubstringSymbol::get_valuep(), symbolvector, and bitwriter::write(). 
 00410 {
00411         // write the symbol indexed by symbolnum
00412         // sets isendofostreamsymbol to true or false depending on if the symbol written is the EOS symbol
00413         // return true on success
00414 
00415         SubstringSymbol *symbolp=symbolvector[symbolnum];
00416         int valuelength=(int)((symbolp->get_valuep())->length());
00417         to.write((symbolp->get_valuep())->c_str(),valuelength);
00418 
00419         // set EOS flag
00420         isendofstreamsymbol=(symbolnum==endofstreamsymbolnum);
00421 
00422         // return success
00423         return true;
00424 }
 | 
  
      
  | 
  
| 
 Helper function to return the text string of a symbol number. 
 
 
 Implements OctaneParser. Definition at line 235 of file substringparser.hpp. References GetDiskspaceUsed(), GetMemoryUsed(), GetParametersInformation(), LoadState(), SaveState(), SetDefaultParameters(), SetParameter(), ShowDebuggingInfo(), and symbolset. 
 00235 {assert(symbolnum<(int)(symbolvector.size()));return symbolvector[symbolnum]->get_value();};
 | 
  
      
  | 
  ||||||||||||
| 
 This function is called to set a variable value. 
 
 
 Reimplemented from OctaneClass. Definition at line 178 of file substringparser.cpp. References Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, Parameter_MaxSymbols_Final, Parameter_OnlyCodeWholeWords, Parameter_ParseMode, Parameter_PruneMinimumWeight, Parameter_PruneReCalculations, Parameter_SpanWordBoundaries, Parameter_UseSmartLookup, and OctaneClass::ParseParameter(). Referenced by LookupSymbolText(). 
 00179 {
00180         // generic function for user-interactive modification of parameters
00181         // (see SetDefaultDictionaryBuildingParameters implementation for descriptions)
00182         // return true if we know this variable and use it
00183         bool bretv=false;
00184         if (parametername=="maxbuildsymbols")
00185                 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_DuringBuild);
00186         else if (parametername=="maxsymbols")
00187                 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_Final);
00188         else if (parametername=="maxSubstring")
00189                 bretv=ParseParameter(parametervalue,Parameter_MaxSubstringSize);
00190         else if (parametername=="onlywords")
00191                 bretv=ParseParameter(parametervalue,Parameter_OnlyCodeWholeWords);
00192         else if (parametername=="spanwords")
00193                 bretv=ParseParameter(parametervalue,Parameter_SpanWordBoundaries);
00194         else if (parametername=="crends")
00195                 bretv=ParseParameter(parametervalue,Parameter_CountCRsAsEOTs);
00196         else if (parametername=="minweight")
00197                 bretv=ParseParameter(parametervalue,Parameter_PruneMinimumWeight);
00198         else if (parametername=="smartlookup")
00199                 bretv=ParseParameter(parametervalue,Parameter_UseSmartLookup);
00200         else if (parametername=="parsemode")
00201                 bretv=ParseParameter(parametervalue,Parameter_ParseMode);
00202         else if (parametername=="recalculations")
00203                 bretv=ParseParameter(parametervalue,Parameter_PruneReCalculations);
00204         return bretv;
00205 }
 | 
  
      
  | 
  ||||||||||||
| 
 Save state of object to output stream (if appropriate). 
 
 Reimplemented from OctaneClass. Definition at line 55 of file substringparser.cpp. Referenced by LookupSymbolText(). 
 00056 {
00057         // save state
00058         // return true on success
00059         bool bretv;
00060         bretv=SaveParameters(to);
00061         if (bretv)
00062                 bretv=SaveSymbols(to);
00063         return bretv;
00064 }
 | 
  
      
  | 
  
| 
 Load state of object from input stream (if appropriate). 
 
 Reimplemented from OctaneClass. Definition at line 66 of file substringparser.cpp. Referenced by LookupSymbolText(). 
 00067 {
00068         // load state
00069         // return true on success
00070         bool bretv;
00071         bretv=LoadParameters(from);
00072         if (bretv)
00073                 bretv=LoadSymbols(from);
00074         return bretv;
00075 }
 | 
  
      
  | 
  
| 
 The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet. 
 
 Definition at line 175 of file substringparser.hpp. Referenced by ParseNextSymbolFromInput(), SubstringParser(), and SynchronizeStateForNewStream().  |