Octane v1.01.20 - The Open Compression Toolkit for C++ http://octane.sourceforge.net/
Homepage | Main | Modules | Class Hierarchy | Compound List | File List | Compound Members | Related Pages

SubstringParser Class Reference
[Parsers]

#include <substringparser.hpp>

Inheritance diagram for SubstringParser:

OctaneParser OctaneClass List of all members.

Detailed Description

The SubstringParser class is a flexible parser, which uses arbitrarily long substrings of text as symbols.

It can be configured to build a symbol set using only whole words, or with substrings that are not affected by word boundary (max symbol length can be set). Various input parsing strategies can be used, from greedy (fastest) to longest first, to near optimal (slowest). Various heuristics for pruning the symbol set can be applied, which can be used to get the symbol set down to a desired size.

Definition at line 154 of file substringparser.hpp.

Public Member Functions

 SubstringParser ()
 Constructor.

 ~SubstringParser ()
 Destructor.

virtual bool CreateSymbolSetUsingStream (bitreader &from)
 Process (train on) an input stream to update/create a symbol set from it.

virtual bool IsReadyToParse ()
 are we ready to parse? i.e. has symbol set been built.

virtual bool RewindAnyBufferedInput (bitreader &from)
 Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed.

virtual void SynchronizeStateForNewStream ()
 Synchronize state for a new stream - this is called every time a new stream is parsed, to allow parser to get into 'start' state,.

virtual int GetSymbolCount ()
 Get a count of the number of symbols stored in the parser.

virtual bool ParseNextSymbolFromInput (bitreader &from, int &symbolnum)
 Parse the next symbol from the input and return its #.

virtual bool WriteSymbolText (bitwriter &to, int symbolnum, bool &isendofstreamsymbol)
 Parse the next symbol from the input, and set symbolnum to the symbol id#,.

virtual string LookupSymbolText (int symbolnum)
 Helper function to return the text string of a symbol number.

virtual void ShowDebuggingInfo ()
 show any debugging info on request (used by various derived classes) [optional]

virtual unsigned int GetMemoryUsed ()
 Report actual current memory usage (in bytes).

virtual unsigned int GetDiskspaceUsed (bool fortempdecompressiononly)
 Report disk space (in bytes) that would be used when saving state to file (in bytes).

virtual std::string GetParametersInformation ()
 Reports information about any parameters available.

virtual void SetDefaultParameters ()
 Reset any parameters to their default values.

virtual bool SetParameter (const std::string &parametername, const std::string &parametervalue)
 This function is called to set a variable value.

virtual bool SaveState (bitwriter &to, bool fortempdecompressiononly)
 Save state of object to output stream (if appropriate).

virtual bool LoadState (bitreader &from)
 Load state of object from input stream (if appropriate).

int GetSetSymbolCount ()
int LookupSymbolNumberFromText (string wordstring)

Protected Types

typedef std::set< SubstringSymbol *,
TSubstringSymbolStringGreater<
SubstringSymbol * > > 
TSymbolSetTYPE
 Typedef for stl set of symbols (used for fast alphabetical lookup).

typedef std::vector< SubstringSymbol * > TSymbolVectorTYPE
 Typedef for stl vector of symbols (used for fast lookup by id#).


Protected Attributes

TSymbolSetTYPE symbolset
 STL set of symbols (used for fast alphabetical lookup).

TSymbolVectorTYPE symbolvector
 STL vector of symbols (used for fast lookup by id#).

TSymbolSetTYPE::iterator symbolset_pos
 STL set iterator - this is a non-local variable to eliminate cost of building it on stack on every local function.

TSymbolVectorTYPE::iterator symbolvector_pos
 Stl vector iterator - this is a non-local variable to eliminate cost of building it on stack on every local function.

char inputbufferstr [SubStrHuff_BiggestSubStringLen]
 Current window buffer for parsing, filled as we read input characters.

int inputbufferlen
 Length of input window buffer.

bool senteos
 The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet.

unsigned int Parameter_MaxSymbols_DuringBuild
 The dictionary will never be allowed to get bigger than this size (smaller will also increase building speed) when this size is exceeded, we will eliminate all symbols with frequencies of smaller than some small #.

unsigned int Parameter_MaxSymbols_Final
 After the entire training file is parsed, we will repeat pruning steps to reduce it to a *maximum* of this many entries the actual symbols kept may be smaller depending on other pruning parameters.

int Parameter_MaxSubstringSize
 How many characters is largest (sub)string stored in dictionary.

bool Parameter_OnlyCodeWholeWords
 Only store characters and whole words?

bool Parameter_SpanWordBoundaries
 Store Substrings that span words (i.e. they contain multiple words).

bool Parameter_CountCRsAsEOTs
unsigned int Parameter_PruneMinimumWeight
 Minimum count for a word or Substring in order to even consider leaving it in the dictionary.

int Parameter_ParseMode
 We can either use a greedy search to find longest Substring starting with first char or we can use smarter but slower heuristics.

bool Parameter_UseSmartLookup
 We can either use a smart search using stl find_lower_bound() to find longest Substring or we can search starting at biggest string backwards.

bool Parameter_PruneReCalculations
 How many iterations should we recompress to test heuristic symbolset cost, if any this tests the currently built symbol set to 'reencode' the input stream and calculates actual usage counts and then reprunes.

int symbolcount
 Keeps track of how many symbols have we created for fast replies.

int endofstreamsymbolnum
 Keeps track of end of stream symbol number for fast replies.

bitreadercurrent_bitreaderp
 Stores last bitreader , so that we can rewind when we want to reparse and check our heuristic symbolset costs.

size_t start_bitreaderposition
 Stores last bitreader starting position before parsing, so that we can rewind when we want to reparse and check our heuristic symbolset costs.


Member Function Documentation

bool SubstringParser::CreateSymbolSetUsingStream bitreader from  )  [virtual]
 

Process (train on) an input stream to update/create a symbol set from it.

Returns:
true on success

Reimplemented from OctaneParser.

Definition at line 281 of file substringparser.cpp.

References current_bitreaderp, Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, start_bitreaderposition, and symbolset.

00282 {
00283         // scan a file and add its "tokens" into our symbol set
00284         string valuestring;
00285         string slidingwindowstring;
00286         unsigned char c;
00287         bool slidingwindowfull=false;
00288         string eotstring=string("");
00289         
00290         // clear any existing symbols
00291         FreeData_Symbols();
00292         // create primitive characters
00293         AddPrimitiveCharacterSubstringSymbols();
00294         
00295         // save the current bitreader and bitreader position for reparsing during pruning
00296         current_bitreaderp = &from;
00297         start_bitreaderposition = from.tell_bit();
00298 
00299         // now parse the file
00300         while (!from.empty())
00301                 {
00302                 // grab a character
00303                 c=from.get_byte();
00304 
00305                 // add the character to our window
00306                 slidingwindowstring+=c;
00307 
00308                 // trim lefthand side if it is too big
00309                 if (!slidingwindowfull)
00310                         {
00311                         // we keep a sliding window which is slightly bigger than the maxSubstring, to account for delimiter boundaries
00312                         if ((int)(slidingwindowstring.length())>Parameter_MaxSubstringSize+5)
00313                                 {
00314                                 // set the flag saying from now on, we are full
00315                                 slidingwindowfull=true;
00316                                 }
00317                         }
00318                 if (slidingwindowfull)
00319                         {
00320                         // trim leftmost character
00321                         slidingwindowstring.erase(0,1);
00322                         }
00323 
00324                 // now add the appropriate Substrings of slidingwindowstring to our dictionary SymbolSet
00325                 AddSubstringsFromSlidingWindowString(slidingwindowstring);
00326 
00327                 // when encoding single lines, each line must end with an end-of-stream (EOS) symbol
00328                 //  but if we train on a file, which is large and has only one EOS at end of file
00329                 //  then the freq. of EOS will be very low (1), so here we can say to count a pretend EOS
00330                 //  whenever we see a CR
00331                 if (c==13 && Parameter_CountCRsAsEOTs)
00332                         {
00333                         // increment frequency for EOS
00334                         UpdateValueStringInSymbolSet(eotstring,1);
00335                         }
00336 
00337                 if (symbolset.size()>Parameter_MaxSymbols_DuringBuild)
00338                         {
00339                         // periodically we may have to prune away lowest frequency symbols in order to keep dictionary manageable
00340                                 // how many should we remove - we don't want to remove just one since that would result in many repetitive calls here
00341                         //  so we just pick some fraction to remove
00342                         PruneSymbolSet_ReduceSizeTo(((unsigned int)(symbolset.size())/5)+1);
00343                         }
00344                 }
00345 
00346         // now final pruning
00347         PruneSymbolSet();
00348 
00349         // and now build the vector for fast random access
00350         BuildSymbolVector();
00351         
00352         // return success
00353         return true;
00354 }

bool SubstringParser::RewindAnyBufferedInput bitreader from  )  [virtual]
 

Let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed.

it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream.

Reimplemented from OctaneParser.

Definition at line 258 of file substringparser.cpp.

References inputbufferlen, bitreader::seek_bit(), and bitreader::tell_bit().

00259 {
00260         // let go of any buffered stream - this is an odd function that can be called be compressor if it wants to hand
00261         //  off the input stream to a new parser or otherwise access the input bitstream from after last symbol parsed.
00262         //  it is necessary because some parsers can read-ahead in input buffer, and so must rewind the bitstream.
00263         // we have to implement this since we do a read-ahead.  note this wouldn't be called in normal circumstances,
00264         //  just if compressors wants us to stop prematurely and hand off the job to someone else.
00265         // return true on success
00266 
00267         // the number of characters we have read ahead in buffer is inputbufferlen
00268         from.seek_bit(from.tell_bit()-(inputbufferlen*8));
00269         return true;
00270 }

bool SubstringParser::ParseNextSymbolFromInput bitreader from,
int &  symbolnum
[virtual]
 

Parse the next symbol from the input and return its #.

Note:
The end of stream situation must to be handled specially:

When a parser encounters the end of a stream, it *MUST* return a symbol signifying an end-of-stream symbol.

This end of stream symbol must be a unique symbol from the symbol set.

Returns:
false *after* the end of stream symbol is returned on prior call.

Implements OctaneParser.

Definition at line 362 of file substringparser.cpp.

References current_bitreaderp, bitreader::get_byte(), SubstringSymbol::get_symbolvectorpos(), inputbufferlen, inputbufferstr, Parameter_MaxSubstringSize, and senteos.

00363 {
00364         // grab an input stream symbol and set its INDEX (in symbol vector) for symbolnum
00365         // return false after EOS
00366         unsigned char c;
00367         int maxSubstringlen=Parameter_MaxSubstringSize;
00368         SubstringSymbol *symbolp;
00369         // note that inputbufferlen are preserved across calls, and inputbufferstr[] is preserved state info
00370 
00371         // try to fill up the inputqueue to Parameter_MaxSubstringSize size, or as much as we got
00372         current_bitreaderp=&from;
00373         while (inputbufferlen<maxSubstringlen && !from.empty())
00374                 {
00375                 // add the character
00376                 c=from.get_byte();
00377                 inputbufferstr[inputbufferlen]=c;
00378                 // increment character positions
00379                 ++inputbufferlen;
00380                 }
00381         
00382         if (inputbufferlen==0)
00383                 {
00384                 // no more symbols left - BUT the question now is, do we return an EOS symbol, or false for no symbols left
00385                 if (senteos)
00386                         {
00387                         // we already sent an EOS so from now on any requests for a symbol returns false saying no more symbols available
00388                         return false;
00389                         }
00390                 else
00391                         {
00392                         // we are going to drop down to return the EOS signal, but we set flag so we don't do it again
00393                         senteos=true;
00394                         }
00395                 }
00396 
00397         // ok now we have a block of up to Parameter_MaxSubstringSize bytes from the left of the inputstr
00398         // now find the longest leftmost (prefix) Substring in our dictionary and encode it, and shift inputqueuestr to the left with remaining bytes
00399         //  and return the new inputqueuestr with remaining bytes, and return length of remaining bytes.
00400         symbolp=FindNextSymbolToEncode(inputbufferstr,inputbufferlen);
00401 //      symbolp=FindNextSubstringSymbolpFromInputQueueStr(inputbufferstr,inputbufferlen);
00402 
00403         symbolnum=symbolp->get_symbolvectorpos();
00404         inputbufferlen=SwallowSymbolFromInputQueueStr(symbolp,inputbufferstr,inputbufferlen,false);
00405         return true;
00406 }

bool SubstringParser::WriteSymbolText bitwriter to,
int  symbolnum,
bool &  isendofstreamsymbol
[virtual]
 

Parse the next symbol from the input, and set symbolnum to the symbol id#,.

Returns:
false *after* end of stream (i.e. first response at end of stream should be the end-of-stream symbol).

Implements OctaneParser.

Definition at line 409 of file substringparser.cpp.

References endofstreamsymbolnum, SubstringSymbol::get_valuep(), symbolvector, and bitwriter::write().

00410 {
00411         // write the symbol indexed by symbolnum
00412         // sets isendofostreamsymbol to true or false depending on if the symbol written is the EOS symbol
00413         // return true on success
00414 
00415         SubstringSymbol *symbolp=symbolvector[symbolnum];
00416         int valuelength=(int)((symbolp->get_valuep())->length());
00417         to.write((symbolp->get_valuep())->c_str(),valuelength);
00418 
00419         // set EOS flag
00420         isendofstreamsymbol=(symbolnum==endofstreamsymbolnum);
00421 
00422         // return success
00423         return true;
00424 }

virtual string SubstringParser::LookupSymbolText int  symbolnum  )  [inline, virtual]
 

Helper function to return the text string of a symbol number.

Returns:
a string with the text of the symbol
Note:
this should be the empty string to signify end of stream symbol.

Implements OctaneParser.

Definition at line 235 of file substringparser.hpp.

References GetDiskspaceUsed(), GetMemoryUsed(), GetParametersInformation(), LoadState(), SaveState(), SetDefaultParameters(), SetParameter(), ShowDebuggingInfo(), and symbolset.

00235 {assert(symbolnum<(int)(symbolvector.size()));return symbolvector[symbolnum]->get_value();};

bool SubstringParser::SetParameter const std::string &  parametername,
const std::string &  parametervalue
[virtual]
 

This function is called to set a variable value.

Note:
this can be called with a variable not owned by this class, in which case the function should return false.
Returns:
true if the variable was found (whether it was set properly or not).

Reimplemented from OctaneClass.

Definition at line 178 of file substringparser.cpp.

References Parameter_MaxSubstringSize, Parameter_MaxSymbols_DuringBuild, Parameter_MaxSymbols_Final, Parameter_OnlyCodeWholeWords, Parameter_ParseMode, Parameter_PruneMinimumWeight, Parameter_PruneReCalculations, Parameter_SpanWordBoundaries, Parameter_UseSmartLookup, and OctaneClass::ParseParameter().

Referenced by LookupSymbolText().

00179 {
00180         // generic function for user-interactive modification of parameters
00181         // (see SetDefaultDictionaryBuildingParameters implementation for descriptions)
00182         // return true if we know this variable and use it
00183         bool bretv=false;
00184         if (parametername=="maxbuildsymbols")
00185                 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_DuringBuild);
00186         else if (parametername=="maxsymbols")
00187                 bretv=ParseParameter(parametervalue,Parameter_MaxSymbols_Final);
00188         else if (parametername=="maxSubstring")
00189                 bretv=ParseParameter(parametervalue,Parameter_MaxSubstringSize);
00190         else if (parametername=="onlywords")
00191                 bretv=ParseParameter(parametervalue,Parameter_OnlyCodeWholeWords);
00192         else if (parametername=="spanwords")
00193                 bretv=ParseParameter(parametervalue,Parameter_SpanWordBoundaries);
00194         else if (parametername=="crends")
00195                 bretv=ParseParameter(parametervalue,Parameter_CountCRsAsEOTs);
00196         else if (parametername=="minweight")
00197                 bretv=ParseParameter(parametervalue,Parameter_PruneMinimumWeight);
00198         else if (parametername=="smartlookup")
00199                 bretv=ParseParameter(parametervalue,Parameter_UseSmartLookup);
00200         else if (parametername=="parsemode")
00201                 bretv=ParseParameter(parametervalue,Parameter_ParseMode);
00202         else if (parametername=="recalculations")
00203                 bretv=ParseParameter(parametervalue,Parameter_PruneReCalculations);
00204         return bretv;
00205 }

bool SubstringParser::SaveState bitwriter to,
bool  fortempdecompressiononly
[virtual]
 

Save state of object to output stream (if appropriate).

Returns:
true on success (or if no information needs to be saved).

Reimplemented from OctaneClass.

Definition at line 55 of file substringparser.cpp.

Referenced by LookupSymbolText().

00056 {
00057         // save state
00058         // return true on success
00059         bool bretv;
00060         bretv=SaveParameters(to);
00061         if (bretv)
00062                 bretv=SaveSymbols(to);
00063         return bretv;
00064 }

bool SubstringParser::LoadState bitreader from  )  [virtual]
 

Load state of object from input stream (if appropriate).

Returns:
true on success (or if no information needs to be loaded).

Reimplemented from OctaneClass.

Definition at line 66 of file substringparser.cpp.

Referenced by LookupSymbolText().

00067 {
00068         // load state
00069         // return true on success
00070         bool bretv;
00071         bretv=LoadParameters(from);
00072         if (bretv)
00073                 bretv=LoadSymbols(from);
00074         return bretv;
00075 }


Member Data Documentation

bool SubstringParser::senteos [protected]
 

The parser needs to return an end-of-symbol character before finishing reading the stream, so we need to track whether we sent an EOS yet.

Todo:
this is an odd responsibility for the parser to have to deal with, so we should probably find a way to eliminate this.

Definition at line 175 of file substringparser.hpp.

Referenced by ParseNextSymbolFromInput(), SubstringParser(), and SynchronizeStateForNewStream().


The documentation for this class was generated from the following files:  
Generated on 20 May 2004 by doxygen 1.3.3