Class LM is a part of speech (POS) based statistical language model implemented as a c++ library. The language model is a kind typically used in speech recognition, optical character recognition (OCR text readers) or any system that needs to predict the next word in a series given some noisy input. This type of language model attempts to predict which word is most likely to occur after one or more "context" words. This allows speech reconizers, for exmaple, to choose between acoustically similar words in a given context. ( e.g. "hat" is more likey than "had" to follow the phrase "my cowboy", on the other hand "had" is more likely than "hat" to follow "would have" )
Typically this modeling is done by counting the number of times certain pairs or triplets of words occur in a large body of text and calculating probabilities directly from those counts ( so called "bigram" and "trigram" models ). In general, the more context words that are used the better the model. On the other hand the returns diminish with length as the number of context words gets larger ( For most text domains. Some domains like letter by letter spelling benefit greatly from large numbers of context words up to about 15. ), and memory use grows exponentially in a fully populated model. In practice a large number of probability cells are empty. However even when using 'sparse' data structures to represent this data memory still grows extremely quickly. For general dictation purposes few systems use models with more than 2 context words ( that is "trigram" models ).
Class LM can both improve memory use and performance by putting all the vocabulary words into "classes". The classes are generated by an entropy minization ( or mutual information maximization ) clustering algorithm based on the word by word bigram and trigram statistics. Interestingly these classes look very much like part of speech classes. Nouns in one cluster, verbs in another etc. Then bigram and trigram statistics are calculated for the classes. ( that we count the number of times word in one class follow words in other classes ) These probability tables are much much smaller than the complete word based trigram tables would be ( by a factor of 10,000 ). This is the difference between having to go to disk for the probabilies and having them all in memory. This means a huge improvement in speed of execution. The class LM is also more accurate in cases of undertraining.
The code presented here is for educational purposes only and may not be used in any publicly distributed software, either commercial or public domain. Individuals are free to use it for any purpose so long as the code - or its compiled form are not redistributed. The algorithms are all well known and free to be used for any purpose. Most of this software is copyright Rafael Baptista, except the memory management code which is a modified form of a memory manager by Chris Kingsley which came with perl 4.0.1.1.
28-Dec-94 1. theory a. paper b. grow c. classify d. blast e. lms i. mono ii. bigram iii. class bi iv. class tri v. lambda smoothing 2. implementation a. monogrmr b. bigrmr c. classer d. class ext e. lm -ii. perplexity -i. verify i. sclm ii. dclm iii. lslm iv. mlm v. blm vi. clm vii. tclm Section 1: Theory In their paper, "Structural Tags, Annealing and Automatic Word Classification", J. McMahon and F. J. Smith describe a method for automatically classifying words. Each word is assigned to a numbered class. All words start out in the same class. The words are then divided into two classes such that the average class mutual information is maximized: M(f) = Sum P( Ci, Cj ) * log( P( Ci, Cj ) / ( P( Ci ) * P( Cj ))) Ci,Cj where f represents a given classification. At each iteration, each class is split into two and the words in that class are redistributed in the two classes, maximizing the information again. M(f) may be maximized using the following simplified formula: M'(f) = Sum P( Ci, Cj ) * log P( Ci, Cj ) - 2 Sum P( Ci ) log P( Cj ) Ci,Cj Ci As the number of classes n and number of words to be classified w gets large the memory requirements grow as O(n^2 + w^2 + w). The amount of computation also grows at least as O( n * w ) but the stopping condition is indeterminate, and tends to really grow as w gets large. On our current SGI's 5000 words is the practical limit, after which the classification tend to take more than several days. A more efficient way to classify large numbers of words is to first classify only some large but practical number of words into classes, typically I have been classifying the top 2000 words of English into about 300 classes. After that each new word to be classified is just put into the class that best satisfies the M(f). The words already in the class are not allowed to move from class to class as words are added. With this method we can classify up to about 80,000 words practically. Once all of the words have been classified according to some large text, to combat overtraining and limit the number of classes we then calculate all of the class probabilities on a new, unseen text and test what happens to M(f) when we merge any two classes. Classes that can be merged without hurting M(f) more than some threshold are merged. After the process of removing classes and adding words to classes after the classification is done, it is likely that some words would now fit better in some class other than the one they are currently in so we can allow a final pass where we test each word against all the other classes. If there is a better ( lower entropy ) match, we move the word into that class. We continue to iterate over the words until we reach a local minimum ( moving single words no longer decreases entropy ). Typically this process ends after a few iterations, but in principle there is no way to predict how many iterations it will take. It is possible for this process to go on forever if there is one group of words that always 'runs away' from another group that is always 'chasing' them. Finally, various language models can be constructed using the transition probabilities of the derived classes. In the evaluation of this classification algorithm, four different language models have been constructed: a word monogram model, a word bigram model, a class bigram model, and a class trigram model. The various models have also been used in weighted combinations ( lambda smoothing ). The models can be combined by a weighted averaging of their output probabilities: P(t) = Sum ( Pe(t) * Le ) e where e is an index over the various language models and the set L is a set of co-efficients where Sum ( Le ) = 1 e These coefficients ( lambdas ) can be trained using the 'forward-backward' algorithm. Given a training text with t trials, a set of e experts ( models ) and a set of e lambdas, a new set of better lambdas can be derived by calculating a quantity ct for each expert ( ct is frequently referred to as the count ). For experts i and j: ct(i) = Sum (( Li Pi(t) ) / Sum ( Lj Pj(t) )) t j Each lambda is then updated: L(i) = ct(i) / t By iterating over this algorithm the lambdas will converge to a global optimum if the set of trials is kept constant at each iteration. These lambdas will of course be optimum for the training text. It is very likely that for a given unseen text different but alike ( to the extent that training and testing texts are alike ) lambdas will be the optimum. Therefore a run-time lambda 'tuning' algorithm that changes the lambdas during recognition would improve performance. Section 2: Implementation This section describes a C++ implementation of the above algorithms. The code has five major sections: 1. Text statistics containers ( monogram, and bigram counters, sorting algorithms etc. ) Contained in the files: monogrmr.cc monogrmr.h 2. Word classification algorithms, in the files: classer.cc classer.h 3. Language model interface and various language models, lambda smoothing and training, and LM statistical analysis tools. In the files: lm.cc lm.h 4. Support code, in the files: dblarray.h a templatized doubling array class name.h a string class data.h a simple text buffer container gmalloc.cc a fast malloc'er config.h config.cc a configuration file manager 5. The top level of various programs: makelm.cc makes LM's from raw text lmsmooth.cc given a set of LM's makes a single lambda smoothed model lmfilter.cc reads in an LM and a text file and returns test set perplexity results sizeof.cc dumps the size of various structures printlm.cc prints the LM statistics in ascii testmono.cc a simple test jig 2.1 Statistics container classes: 2.1.1 PMONO and MGE, a monogram statistic container. MGE stands for monogram entry. It has room in it for a word orthography, a count of the number of times a word has occurred and a 'frequency code' that represents the position of the word in a frequency ordered list. PMONO is basically a list of MGE's. PMONO makes no attempt to impose an ordering on the list. Its role is simply count the number of times each word occurs in text. Given a word orthography, PMONO provides fast access to the MGE's by keeping them in a series of hashed lists ( _monograms[] ). PMONO also keeps three counters, one keeps a sum of the all the counts of all the words ( _total_mono ), another keeps a count of the number of unique words seen so far ( _num_u_words ) and a third ( _tail_mono ) is explained below. At construction time PMONO contains no MGE's, just a series of empty lists and the two counters set to zero. Buffers of text are passed to PMONO via the 'void ProcessBuffer( char* )' method that counts the number of occurrences of each word in the text. In order for its statistics to agree with the bigram container class BIGRMR ( below ) '_total_mono' is incremented by one for each call to 'ProcessBuffer' to simulate the occurrence of an unknown word before and after the given buffer. PMONO also provides a 'void Add( PMONO& )' method to add the statistics accumulated in two PMONO's. Statistics are accumulated in the implicit argument. The explicit PMONO is unaffected. When the amount of text read is large, the number of words in the tail, that is words which accumulate only very small counts ( proper nouns, misspelled words, digit strings etc. ) can become quite large. These words will not play a role in the later language model building stage and can be purged in the PMONO. A method 'void Truncate( unsigned int num )' is provided that will keep only the 'num' words with the highest frequencies. The counts of the words that are removed are kept in '_tail_mono'. If the PMONO has been incorporated into a MONOGRMR ( below ) care should be taken not to cut too deeply or else the MONOGRMR object may be left with pointers into purged MGE's. Typically is a LM with say 20,000 words is desired, then once the monogram statistics are accumulated, the PMONO can be truncated down to say 30,000 unique words. It is a good idea to truncate well above the minimum required if more statistics will be accumulated because the ordering of the words will be different as more statistics available. It is important to avoid the situation where a word that was below the truncation threshold would have, with more statistics climbed up the ranks to become among the counted words in the LM. In this case that word would be unfairly penalized. 2.1.2 MONOGRMR - ordered monogram statistics container MONOGRMR provides a frequency ordering of the top n most frequent words in a PMONO. MONOGRMR contains an array of frequency ordered pointers to the MGE's in a PMONO ( _freq_array ), a counter ( _num_words ) that contains the number of frequency ordered words, and a pointer to the original PMONO ( _pmono ). MONOGRMR also contains two items ( _time_stamp and _registry ) explained below. At construction time, pointers to MGE's in the client PMONO are put in an array, the array is sorted and the top n MGE*'s are copied into _freq_array. All the words that are left over and the count in PMONO::_tail_mono are summed, and the sum is placed into a special 'word', "*OTHER*", with its own MGE which is not part of the client PMONO. The MGE's that make it into _freq_array are each assigned a '_freq_code' according to their placement on the list. "*OTHER*" is '_freq_code' 0 by convention, as are all the MGE's not in _freq_array. It is sometimes necessary for two MONOGRMR's grown with different statistical sources to need to have a system of freq_codes that agree. Of course one of MONOGRMR's will end up with fc's that are not truly frequency ordered, but if both MONOGRMR's are trained on similar large texts is is likely that the ordering differences will just be 'noise'. The 'void SetFCs( MONOGRMR& )' will reorder the fc's of the implicit argument to agree with those of the explicit argument. If the implicit argument does not have an MGE for a word that is in the new ordering, a new MGE will be created, with a count of zero, and a warning will be issued. Generally if this happens, the two MONOGRMR's whose fc's are being matched are different enough that real information is being lost by the reordering process. 2.1.3 BIGRMR and B_ENTRY - an ordered bigram statistic container class A bigram count is the number of times one word follows another in a given text. The count for "A B" is different than "B A". BIGRMR collects a matrix of bigram statistics for pairs of the top 'n' words. If the table is small enough, n < MATRIX_MAX, it is just kept in a matrix, otherwise a sparse matrix implementation is used. Once all the statistics have been collected BIGRMR can be converted into a 'static sparse matrix' which is much more compact, but does not allow the addition of new words into the matrix. In addition to the various fields required to implement the various matrix implementations, BIGRMR has, a pointer to the MONOGRMR that provides the ordering used to choose the top 'n' words ( _monogrmr ), and a count of the 'n' being used ( _num_words ). BIGRMR also has _time_stamp and _registry members explained below. BIGRMR is constructed using a MONOGRMR which provides the ordering and the number of words to collect statistics for. A table is initialized with zero counts, and the '_num_words' counter is set to zero. If the matrix is small enough a simple array of unsigned int's is allocated ( _bigrams ). If it is large enough then a matrix of DBLARRAY< B_ENTRY >'s is allocated ( _sparse_bi ). Each B_ENTRY represents a single bigram count ( int '_val' ). It also contains the most significant bits of the FC coordinates of the bigram ( '_x' and '_y' ). Bigrams are accessed using the two FC's of their members as coordinates into the matrix. In the sparse matrix implementation the least significant bits of the coordinates are used to access a particular DBLARRAY< B_ENTRY >. The the array is searched to the B_ENTRY that matches the unused portion of the coordinates. The fact the least significant bits are used to access the array itself tends to cause the DBLARRAY's to be allocated evenly, each array getting one member from the most populous submatrix, 3 from the next three most populous matrixes etc. Also in most cases the the arrays will be in frequency order with the most frequent item accessed first. To add a word into the sparse matrix, access is attempted, if the word is not found, then a new B_ENTRY is added to the appropriate list. This is why DBLARRAY's are used instead of simple arrays. Once the statistics are all accumulated and no new entries will be required this flexible sparse implementation can be converted to a static one with the 'void MakeStatic()' method. All of the bigram counts are copied into an array ( _st_sp_bigrams ), and all of the pairs of coordinates in the B_ENTRY's are copied into an array of char ( _minor_idx ). The odd entries are the '_x' coordinates, and even are the '_y'. An array of ints ( _major_idx ) has indexes into the starting and ending points of the old DBLARRAY's that have been copied. So the first item in '_major_idx' is 0 which means the contents of '_sparse_bi[0]' are now stored in offset 0 of '_minor_idx' and '_st_sp_bigrams'. The second entry tells where the first array ends, and the second begins. Access is much like the sparse matrix implementation above: the least significant bits are used to index '_st_sp_bigrams' and find out where the beginning and end of the sub-array is. Then we loop through the array attempting to match the most significant portion of the coordinates. If a match is found, we index '_st_sp_bigrams' at that offset. As in MONOGRMR, statistics are accumulated using a 'void ProcessBuffer( char* )' method. All pairs of words are counted. Pairs where one or both words are not above the frequency threshold are counted as statistics where one or both words are mapped to FC 0, or the "*OTHER*" word. In order to make all the statistics compatible we assume that for each call to 'ProcessBuffer', the first bigram is "*OTHER*" and the first word, and that the last bigram is the last word and "*OTHER*". In order to make the MONOGRMR and BIGRMR agree, an extra count is added to the MONOGRMR, as described above. BIGRMR also provides a 'void Verify()' method that checks that all the statistics between the MONOGRMR and BIGRMR and within the BIGRMR corroborate. So the sum of the column x of the bigram table is the same as the sum of the row x of the table and is the same as the monogram count of the word with fc x. This check works partly because of the normalizations described above. 2.1.4 Word hashing In order to count statistics on millions of words of text it is important to be able to convert from orthographies into a numerical representation, like a frequency code quickly. In order to do this, the MONOGRMR, which stores this information, relies on a hashing scheme where orthographies are quickly transformed into numbers which can be used to look up the FC. The hash numbers are not unique, but in general should be evenly distributed among the world of text strings, and should be computed fast. A macro ( HASH_WORD ) is used for this purpose which sums the odd and even characters a in text string in the low and high bytes of an unsigned short. 2.1.5 time stamps and registry The monogrmr and bigrmr data structures can be quite large. They are in turn used by higher level objects ( typically LM's ). Because they are large the higher level objects refer to them through pointers and many objects may point to the same MONOGRMR or BIGRMR. It is sometimes necessary to write these higher level structures to disk and it is convenient to keep a copy of the appropriate MONOGRMR of BIGRMR in each file. When the files are read back it would be a waste to allocate many versions of the same large data structures. At the time a MONOGRMR or BIGRMR is created the system clock time is polled and put into '_time_stamp'. This field is them used as a unique identifier of that object. '_registry' is a static array of pointers to objects of the appropriate class. So for example all objects of class MONOGRMR have a pointer to MONOGRMR::_registry and in turn MONOGRMR::_registry has a pointer to each object that has been read from a file. Objects that are created at run-time are not registered. When an object that contains a MONOGRMR or BIGRMR is read from a file, the read routine should check all the other objects in the '_registry' to see if the '_time_stamp's match. is so the client object should deallocate the newly read MONOGRMR or BIGRMR and refer to the one already in the registry. This way only one copy of each MONOGRMR or BIGRMR will be read in from a file. It is possible on a machine with a broken 'time_t time( time_t& )' system call, and very bad luck, to have two different object share the same '_time_stamp'. 2.2 Word Classification algorithms 2.2.1 CLASSIFIER Given a set of bigram and monogram counts, CLASSIFIER clusters words into classes maximizing the predictive power of the classes. The details of the algorithm are given in section one above. CLASSIFIER contains a pointer to a BIGRAM container ( _bigrmr ), an array of shorts that hold a class number for each word being classified ( _tags ), an array of pointers to arrays of ints that hold the counts of class bigrams ( _bi_counts ), and an array of ints with a monogram count for each class ( _mono_counts ). There are also two counters, one holds the number of words classified ( _num_words ), and the other contains the total number of classes defined so far by the CLASSIFIER ( _num_pos ). The bigram table is always a matrix of _num_pos * _num_pos. The monogram array is of length _num_pos and _tags is of length _num_words. CLASSIFIER is constructed from a BIGRMR. Initially there is only one class and all of the words defined in the BIGRMR are assigned to that class. So _num_pos = 1, the _tags array is initialized to every words having class number 0, the _bi_counts table consists of a single entry at '_bi_counts[0][0]' equal to the sum of the bigrams in the BIGRMR, and _mono_counts[0] is equal to the sum of the monograms in the MONOGRMR. Although conceptually the class monogram and bigram tables have only one entry, more space is allocated ahead of time to reduce the amount of malloc'ing. The symbol CLS_MAX defines the maximum number of classes that may be posited and is used to allocate buffers at construction time. The 'void Grow()' method is used to cause the CLASSIFIER to start splitting each class in two, and distributing the words from the parent class into the two children, maximizing the mutual information. Testing the effect of moving a word from one class to another is accomplished with the 'float Test( ... )' method. Test returns the change in entropy due to moving the given word. To actually move a word from one class to another the 'void Move( ... )' method is used. This function changes the '_tags' value for a word and updates the class monogram and bigram tables. The words are classified until splitting off new classes produces no decrease in entropy. At this point the words are well trained on the input text but may be overtrained with respect to similar unseen texts. To combat this the 'void Blast( float Thresh )' method is used. First the CLASSIFIER's statistics tables must be updated with statistics from another text, via a call to the 'void Set( BIGRMR& )' method. 'void Blast()' tests the effects of merging each pair of classes ( 'float MTest(...)' ), if the merge increases the entropy by less than thresh, then the two classes are merged into one, and the empty class is removed ( 'void Merge(...)' ). The amount of computation required to classify words grows very quickly as the number of words to be classified increases. It is sufficient to classify a small number of words and then assign the remaining words into established classes. This is done with the CLASS_EXT object described below. Once CLASS_EXT has assigned words to classes, these words may be incorporated into the CLASSIFIER with the 'void Extend( CLASS_EXT& )' method. This method simply extends the '_tags' array to accommodate the new words and then updates the class monogram and bigram tables according to the current '_bigrmr'. After manipulating the class distributions with 'void Blast( float )' and 'void Extend( CLASS_EXT& )', the class distributions may no longer be at a local minimum. The 'void Migrate( int )' method tries moving each word from one class to another until a local minimum is reached. After a number of classes have been merged it is possible for there to be some number of classes with no member words. These classes would be taking up space in the bigram table. The 'void Prune()' method removes array's from the bigram table that are unused. The 'void Compress()' method moves changes the class id numbers of classes so that the final set of classes has contiguous numbers. It does not move words from class to class, but rather just renames the classes. The 'void Optimize(...)' methods simply offer a top level function to apply some of the previously described functions. 2.2.2 CLASS_EXT, CLASS_EXT_BI, and CLASS_EXT_DATA CLASS_EXT contains an extended word classification that can be added to the word classification in a CLASSIFIER via the 'void CLASSIFIER::Extend(...)' method. The CLASS_EXT object has an array of shorts ( _ext_pos ) that hold the class id for each word that is part of the extended classification. The indexes into '_ext_pos' are the FC's of the words with respect to the MONOGRMR in the BIGRMR in the CLASSIFIER used to classify the words. The CLASS_EXT object also has three counters, 'num_pos' represents the number of classes used in the classification, '_num_words' represents the number of words classified by the CLASS_EXT, and '_word_offset' represents the offset to the word's FC that should be used to lookup the word's class in '_ext_pos'. '_word_offset' is equal to the number of words already classified in the given CLASSIFIER. The CLASS_EXT object is used only as a base class for other CLASS_EXT subtypes that do the actual classification. CLASS_EXT_DATA classifies words directly from input text, counting word bigrams and monograms directly. CLASS_EXT_BI classifies words from a BIGRMR. CLASS_EXT_BI is more efficient to use if the BIGRMR required already exists. 2.3 Language Models 2.3.1 The general LM interface LM is the base class from which all language models are derived. It defines a ( mostly abstract ) interface to the various LM's. There are several virtual methods: 'float LogProb( unsigned int )' Returns the probability of the given word according to the model and an internal context. 'unsigned int Id( char* )' Given an orthography, return a context id. For many models this is the FC of a word relative to an internal MONOGRMR. 'void SetContext( unsigned int )' Sets the internal context of an LM ( if it has a context ). 'int IsNotZero( unsigned int )' Returns true if at least one observation of the target and the context was used to create the LM. 'int Covered( unsigned int )' Returns true if the target is not mapped to *OTHER*. 'unsigned int NumWords()' Returns the number of distinct targets available. 'unsigned int NumPos()' For LM's with classes, returns the number of classes being used. 'unsigned int NumOther()' Returns the number of targets mapped to *OTHER*. 'LM_TYPE Type()' Returns the type of the LM. 'int ReadIn( FILE* )' Reads the LM in from a file. 'int WriteOut( FILE* )' Writes the LM out to a file. Reading in and writing out an LM through the generic LM interface requires calling special non-class functions. The 'LM* ReadInLM( FILE* )' reads an LM in from a file. The first byte read represents the LM_TYPE of the object that follows. This byte is then used to dispatch the correct LM specific 'ReadIn' method. Likewise the 'int WriteOutLM( FILE*, LM* )' function write out a byte representing the LM_TYPE and then dispatches the appropriate 'WriteOut' method. 2.3.1.1 SCLM, DCLM SCLM and DCLM are derived from LM. Many actual LM's ( like CLM, BLM etc. ) are in turn derived from one of these. These two classes manage the context of a model. SCLM is used for models with only a single context, like the word bigram language model ( BLM ), and DCLM is used for models with two contexts, like the class trigram language model ( TCLM ). Both classes implement the 'void SetContext( ... )' method and the 'int Covered( ... )' method. The contexts ( for SCLM '_pre' and for DCLM '_pre1' and '_pre2' ) are then available to the derived classes. 2.3.2 LSLM LSLM allows for the weighted combination of several LM's. The LM's are kept in an array of LM*'s ( '_models' ) and the coefficients are kept in an array of floats ( '_lambdas' ). A count of the number of LM's combined is kept in '_num_lms'. The LM at '_models[0]' is considered the primary lm and is used for all function calls where lambdas combination is not appropriate ( all functions except 'float LogProb( int )' ). Typically the most powerful language model should be the primary model ( BLM, CLM, and TCLM are usually good choices ). The lambdas should be computed by running 'void LamTune( ... )'. A text representative of the recognition task should be passed in. At the completion of the function, the lambdas will be tuned to within LAMBDA_TUNE_THRESH of the 'ideal' values for that text. LAMBDA_CYCLE_SIZE is the number of trials that will be read in between updates of the lambdas. In general there is no advantage to making LAMBDA_CYCLE_SIZE larger that the number of trial in the input text ( though it doesn't hurt ). the convergence is more steady as LAMBDA_CYCLE_SIZE gets larger. There is a lambda tuning algorithm that modifies the lambdas during recognition to tune them for short runs of text during recognition. So as the text matches a model better or worse during recognition, the lambdas will adapt. As of the writing of this document this system is not yet working. The value LAMBDA_TWEAK changes the amount by which the lambdas adapt at the end of each trial. A value of 0 turns off the adaptation mechanism. 2.3.3 MLM MLM is a monogram language model. Probabilities are produced according to word frequencies without regard to any context. This is a poor model of a natural language, but it is the a good model for most 'random' lists of words. 2.3.4 BLM BLM is a bigram language model. Probabilities are assigned according to the frequency of a word given its immediate predecessor. This is typically a very powerful model, but can be extremely large with a great deal of redundant information. In most real applications it will suffer from under-training. 2.3.5 CLM CLM is a class based bigram model. Probabilities are assigned according to the frequency of a word given the class that it is assigned to and the class of its predecessor. There are typically fewer than 300 classes and the model is very small. It gives most of the performance of the BLM model in much less space. This is partly due to the fact that it does not usually suffer from under-training. 2.3.6 TCLM TCLM is a class based trigram language model. It works just like CLM except the previous two contexts are used. Typically it is smaller than the BLM and slightly more powerful. Like BLM, it is difficult to fully train it. 2.3.7 XLM XLM is a constant LM. It returns a constant probability based on the number of targets it is told exist. In natural languages there are an infinite number of potential target words so this model gives 'bogus' probabilities ( they do not sum to one ). This model should be used carefully. It is sometimes useful in a lambda combination if the 'zeroes' of another model are under estimated. 2.4 Support classes 2.4.1 DBLARRAY DBLARRAY implements a doubling array templatized container class. It is a simple form of a dynamic array that will allocate a new buffer of double its current size when it runs out of allocation space. This implementation does not use pointers but rather copies the top level structure it is given into the array space. So for simple structures, there is no need to malloc new little buffers each time. The same buffer may be used and passed in and then re-used once DBLARRAY has copied the contents out. DBLARRAY_WO extends the DBLARRAY class to have 'ReadIn' and 'WriteOut' methods. This class requires that the template class also have a 'ReadIn' and 'WriteOut' method. DBLSTACK extends DBLARRAY to have 'Push' and 'Pop' methods so that the DBLARRAY may be used like a L.I.F.O. stack. UDBLARRAY imposes a uniqueness constraint, where two identical objects may not be put on the same array. Each call to 'Add' checks if the new object is already on the list and only adds objects that are not '==' to objects already on the list. The template class must define an '==' operation. 2.4.2 name NAME is a trivial string class that allocates and de-allocates its own memory. 2.4.3 data DATA is a trivial class that allows a file name to be easily converted into a buffer of text. The file is dumped complete into a buffer, and returned. 2.4.4 config CONFIG implements configuration file handling. A CONFIG object is created from a filename. The file is read and dumped into a buffer. The buffer is then parsed and two arrays are created that point to the variable names in one array and their values in another array. The values of given configuration variables may then be accessed with the appropriate 'GetParam()' method. The variable name and a pointer to the program variable to be set are passed in. If the variable name is matched in the config file, then the program variable is set according to the value in the configuration file and an index into the config file line that was used is returned. Otherwise -1 is returned. the returned index may be used to search out all occurrences of the same variable in a configuration file. 2.4.5 gmalloc gmalloc is a fast malloc'er. It allocates pools of memory in 2^n sized blocks and returns the smallest 2^n sized block that fulfills the user's request. This in general wastes about 1/3 of program memory, but is very fast. 2.5 Programs 2.5.1 makelm This program makes calculates various language models from input text data. The program works in two stages. In the first stage it accumulates statistics and classifies words. The second stage uses this data to create language models. Each stage in turn has several sub-sections. Roughly the first stage first collects monogram and bigram statistics for growing classes. These statistics are distilled from text files whose names are passed in after the -g option. Then words are classified, with 'CLASSIFIER::Grow()'. Then monograms and bigrams for merging classes are collected from files passed in after the -b option. Next classes are merged using this data, with 'CLASSIFIER::Blast()'. Finally more words are classified using the 'CLASSIFIER::Extend' method. There are three numerical options that control this process. '-n' specifies the number of word used for the initial classification. '-e' specifies the number of words to classify after the initial classes are set. If '-e' is <= '-n', then '-e' has no effect. '-d' specifies a threshold for merging the classes. '.00025' is a typical value. Larger values will cause deeper pruning. The second stage uses this data to make lm's. The '-k' option, specifies the LM's to make. All LM types except LSLM can be specified with -k. For LSLM, they must be specified with '-s'. '-l' specifies the file stem used in naming the various lm files. '-t' specifies a list of text files for which test set perplexity numbers will be generated. '-v' will cause the program to verify that the LM's produced give fair probabilities ( that sum to 1 ). Finally the first and second stages may be run independently. The '-w' option causes the first stage statistics to be written out. '-r' option causes the program to skip the first stage, read the statistics from a file, and start right on to the second stage. 2.5.2 lmsmooth lmsmooth reads in a set of LM's and lambdas and a training text and writes out a lambda smoothed lm with lambdas optimized for the training text. The program takes a single argument, the name of a configuration file. The names of language models to be used should be specified with the 'language_model' parameter. The lambda to use for the given lm should be specified with the 'lambda' parameter. The lambdas will be matched with lm's specified in the order that they appear in the file ( first lambda with first lm, second with second etc. ). The lambdas and LM's should match 1 to 1. The training text is specified with the 'lambda_training_text' parameter. lmsmooth will also calculate test set perplexities for texts specified with the 'test_data' parameter. Perplexities will be calculated automatically once with lambda adaptation turned off. Lambda adapted perplexities can be calculated by setting one or more 'lambda_tweak' parameters. Perplexities will be calculated for each of the test files at each 'lambda_tweak' setting. A constant LM will be smoothed into the new LM is 'use_constant_lm' is set to true. 2.5.3 lmfilter lmfilter simply loads an lm, given as the first argument, and produces test set perplexities for a given text, the second argument.