Class LM is a part of speech (POS) based statistical language model implemented as a c++ library. The language model is a kind typically used in speech recognition, optical character recognition (OCR text readers) or any system that needs to predict the next word in a series given some noisy input. This type of language model attempts to predict which word is most likely to occur after one or more "context" words. This allows speech reconizers, for exmaple, to choose between acoustically similar words in a given context. ( e.g. "hat" is more likey than "had" to follow the phrase "my cowboy", on the other hand "had" is more likely than "hat" to follow "would have" )

Typically this modeling is done by counting the number of times certain pairs or triplets of words occur in a large body of text and calculating probabilities directly from those counts ( so called "bigram" and "trigram" models ). In general, the more context words that are used the better the model. On the other hand the returns diminish with length as the number of context words gets larger ( For most text domains. Some domains like letter by letter spelling benefit greatly from large numbers of context words up to about 15. ), and memory use grows exponentially in a fully populated model. In practice a large number of probability cells are empty. However even when using 'sparse' data structures to represent this data memory still grows extremely quickly. For general dictation purposes few systems use models with more than 2 context words ( that is "trigram" models ).

Class LM can both improve memory use and performance by putting all the vocabulary words into "classes". The classes are generated by an entropy minization ( or mutual information maximization ) clustering algorithm based on the word by word bigram and trigram statistics. Interestingly these classes look very much like part of speech classes. Nouns in one cluster, verbs in another etc. Then bigram and trigram statistics are calculated for the classes. ( that we count the number of times word in one class follow words in other classes ) These probability tables are much much smaller than the complete word based trigram tables would be ( by a factor of 10,000 ). This is the difference between having to go to disk for the probabilies and having them all in memory. This means a huge improvement in speed of execution. The class LM is also more accurate in cases of undertraining.

The code presented here is for educational purposes only and may not be used in any publicly distributed software, either commercial or public domain. Individuals are free to use it for any purpose so long as the code - or its compiled form are not redistributed. The algorithms are all well known and free to be used for any purpose. Most of this software is copyright Rafael Baptista, except the memory management code which is a modified form of a memory manager by Chris Kingsley which came with perl 4.0.1.1.


28-Dec-94

1. theory
	a. paper
	b. grow
	c. classify	
	d. blast	
	e. lms
		i. mono
		ii. bigram
		iii. class bi
		iv. class tri
		v. lambda smoothing

2. implementation
	a. monogrmr
	b. bigrmr
	c. classer
	d. class ext
	e. lm
		-ii. perplexity
		-i. verify
		i. sclm
		ii. dclm
		iii. lslm
		iv. mlm
		v. blm
		vi. clm
		vii. tclm

Section 1: Theory

In their paper, "Structural Tags, Annealing and Automatic Word Classification",
J. McMahon and F. J. Smith describe a method for automatically classifying
words. Each word is assigned to a  numbered class. All words start out in
the same class. The words are then divided into two classes such that the
average class mutual information is maximized:

	M(f) = Sum   P( Ci, Cj ) * log( P( Ci, Cj ) / ( P( Ci ) * P( Cj )))
	       Ci,Cj

where f represents a given classification. At each iteration, each class is
split into two and the words in that class are redistributed in the two
classes, maximizing the information again. M(f) may be maximized using the
following simplified formula:

	M'(f) = Sum  P( Ci, Cj ) * log P( Ci, Cj ) - 2 Sum P( Ci ) log P( Cj )
		Ci,Cj				       Ci

As the number of classes n and number of words to be classified w gets large
the memory requirements grow as O(n^2 + w^2 + w). The amount of computation
also grows at least as O( n * w ) but the stopping condition is indeterminate,
and tends to really grow as w gets large. On our current SGI's 5000 words is
the practical limit, after which the classification tend to take more than
several days. A more efficient way to classify large numbers of words is
to first classify only some large but practical number of words into classes,
typically I have been classifying the top 2000 words of English into about 300
classes. After that each new word to be classified is just put into the class
that best satisfies the M(f). The words already in the class are not allowed
to move from class to class as words are added. With this method we can
classify up to about 80,000 words practically.

Once all of the words have been classified according to some large text,
to combat overtraining and limit the number of classes we then calculate all
of the class probabilities on a new, unseen text and test what happens to M(f)
when we merge any two classes. Classes that can be merged without hurting
M(f) more than some threshold are merged.

After the process of removing classes and adding words to classes after
the classification is done, it is likely that some words would now fit better
in some class other than the one they are currently in so we can allow a final
pass where we test each word against all the other classes. If there is a
better ( lower entropy ) match, we move the word into that class. We continue
to iterate over the words until we reach a local minimum ( moving single
words no longer decreases entropy ). Typically this process ends after a few
iterations, but in principle there is no way to predict how many iterations
it will take. It is possible for this process to go on forever if there is one
group of words that always 'runs away' from another group that is always
'chasing' them.

Finally, various language models can be constructed using the transition
probabilities of the derived classes. In the evaluation of this classification
algorithm, four different language models have been constructed: a word
monogram model, a word bigram model, a class bigram model, and
a class trigram model. The various models have also been used in weighted
combinations ( lambda smoothing ).

The models can be combined by a weighted averaging of their output
probabilities:

	P(t) = Sum ( Pe(t) * Le )
	 	e

where e is an index over the various language models and the set L is
a set of co-efficients where

	Sum ( Le ) = 1
	 e

These coefficients ( lambdas ) can be trained using the 'forward-backward'
algorithm. Given a training text with t trials, a set of e experts
( models ) and a set of e lambdas, a new set of better lambdas can be
derived by calculating a quantity ct for each expert ( ct is frequently
referred to as the count ). For experts i and j:

	ct(i) = Sum (( Li Pi(t) ) / Sum ( Lj Pj(t) ))
		 t		     j

Each lambda is then updated:

	L(i) = ct(i) / t

By iterating over this algorithm the lambdas will converge to a global
optimum if the set of trials is kept constant at each iteration. These
lambdas will of course be optimum for the training text. It is very likely
that for a given unseen text different but alike ( to the extent that training
and testing texts are alike ) lambdas will be the optimum.

Therefore a run-time lambda 'tuning' algorithm that changes the lambdas
during recognition would improve performance.

Section 2: Implementation

This section describes a C++ implementation of the above algorithms. The code
has five major sections:

	1. Text statistics containers ( monogram, and bigram counters,
	   sorting algorithms etc. ) Contained in the files:
		monogrmr.cc
		monogrmr.h
	2. Word classification algorithms, in the files:
		classer.cc
		classer.h
	3. Language model interface and various language models, lambda
	   smoothing and training, and LM statistical analysis tools. In
	   the files:
		lm.cc
		lm.h
	4. Support code, in the files:
		dblarray.h		a templatized doubling array class
		name.h			a string class
		data.h			a simple text buffer container
		gmalloc.cc		a fast malloc'er
		config.h
		config.cc		a configuration file manager
	5. The top level of various programs:
		makelm.cc		makes LM's from raw text
		lmsmooth.cc		given a set of LM's makes a single
					lambda smoothed model
		lmfilter.cc		reads in an LM and a text file and
					returns test set perplexity results
		sizeof.cc		dumps the size of various structures
		printlm.cc		prints the LM statistics in ascii
		testmono.cc		a simple test jig

2.1 Statistics container classes:
2.1.1 PMONO and MGE, a monogram statistic container.

MGE stands for monogram entry. It has room in it for a word orthography, a
count of the number of times a word has occurred and a 'frequency code'
that represents the position of the word in a frequency ordered list.

PMONO is basically a list of MGE's. PMONO makes no attempt to impose an
ordering on the list. Its role is simply count the number of times each word
occurs in text. Given a word orthography, PMONO provides fast access to the
MGE's by keeping them in a series of hashed lists ( _monograms[] ). PMONO
also keeps three counters, one keeps a sum of the all the counts of all the
words ( _total_mono ), another keeps a count of the number of unique words
seen so far ( _num_u_words ) and a third ( _tail_mono ) is explained below.

At construction time PMONO contains no MGE's, just a series of empty lists and
the two counters set to zero. Buffers of text are passed to PMONO via the
'void ProcessBuffer( char* )' method that counts the number of occurrences of
each word in the text. In order for its statistics to agree with the bigram
container class BIGRMR ( below ) '_total_mono' is incremented by one for each
call to 'ProcessBuffer' to simulate the occurrence of an unknown word before
and after the given buffer.

PMONO also provides a 'void Add( PMONO& )' method to add the statistics
accumulated in two PMONO's. Statistics are accumulated in the implicit
argument. The explicit PMONO is unaffected.

When the amount of text read is large, the number of words in the tail,
that is words which accumulate only very small counts ( proper nouns,
misspelled words, digit strings etc. ) can become quite large. These words
will not play a role in the later language model building stage and can be
purged in the PMONO. A method 'void Truncate( unsigned int num )' is provided
that will keep only the 'num' words with the highest frequencies. The counts of
the words that are removed are kept in '_tail_mono'. If the PMONO has been
incorporated into a MONOGRMR ( below ) care should be taken not to cut too
deeply or else the MONOGRMR object may be left with pointers into purged MGE's.
Typically is a LM with say 20,000 words is desired, then once the monogram
statistics are accumulated, the PMONO can be truncated down to say 30,000
unique words. It is a good idea to truncate well above the minimum required
if more statistics will be accumulated because the ordering of the words will
be different as more statistics available. It is important to avoid the
situation where a word that was below the truncation threshold would have,
with more statistics climbed up the ranks to become among the counted
words in the LM. In this case that word would be unfairly penalized.

2.1.2 MONOGRMR - ordered monogram statistics container

MONOGRMR provides a frequency ordering of the top n most frequent words in a
PMONO. MONOGRMR contains an array of frequency ordered pointers to the MGE's
in a PMONO ( _freq_array ), a counter ( _num_words ) that contains the number
of frequency ordered words, and a pointer to the original PMONO ( _pmono ).
MONOGRMR also contains two items ( _time_stamp and _registry ) explained below.

At construction time, pointers to MGE's in the client PMONO are put in an
array, the array is sorted and the top n MGE*'s are copied into _freq_array.
All the words that are left over and the count in PMONO::_tail_mono are summed,
and the sum is placed into a special 'word', "*OTHER*", with its own MGE which
is not part of the client PMONO. The MGE's that make it into _freq_array are
each assigned a '_freq_code' according to their placement on the list.
"*OTHER*" is '_freq_code' 0 by convention, as are all the MGE's not in
_freq_array.

It is sometimes necessary for two MONOGRMR's grown with different statistical
sources to need to have a system of freq_codes that agree. Of course one of
MONOGRMR's will end up with fc's that are not truly frequency ordered, but if
both MONOGRMR's are trained on similar large texts is is likely that the
ordering differences will just be 'noise'. The 'void SetFCs( MONOGRMR& )'
will reorder the fc's of the implicit argument to agree with those
of the explicit argument. If the implicit argument does not have an MGE for
a word that is in the new ordering, a new MGE will be created, with a count
of zero, and a warning will be issued. Generally if this happens, the two
MONOGRMR's whose fc's are being matched are different enough that real
information is being lost by the reordering process.

2.1.3 BIGRMR and B_ENTRY - an ordered bigram statistic container class

A bigram count is the number of times one word follows another in a given
text. The count for "A B" is different than "B A".

BIGRMR collects a matrix of bigram statistics for pairs of the top 'n'
words. If the table is small enough, n < MATRIX_MAX,  it is just kept in a
matrix, otherwise a sparse matrix implementation is used. Once all the
statistics have been collected BIGRMR can be converted into a 'static
sparse matrix' which is much more compact, but does not allow the addition
of new words into the matrix. In addition to the various fields required
to implement the various matrix implementations, BIGRMR has, a pointer to
the MONOGRMR that provides the ordering used to choose the top 'n' words
( _monogrmr ), and a count of the 'n' being used ( _num_words ). BIGRMR
also has _time_stamp and _registry members explained below.

BIGRMR is constructed using a MONOGRMR which provides the ordering and
the number of words to collect statistics for. A table is initialized
with zero counts, and the '_num_words' counter is set to zero. If the
matrix is small enough a simple array of unsigned int's is allocated
( _bigrams ). If it is large enough then a matrix of DBLARRAY< B_ENTRY >'s
is allocated ( _sparse_bi ).

Each B_ENTRY represents a single bigram count ( int '_val' ). It also contains
the most significant bits of the FC coordinates of the bigram
( '_x' and '_y' ).

Bigrams are accessed using the two FC's of their members as coordinates into
the matrix. In the sparse matrix implementation the least significant bits of
the coordinates are used to access a particular DBLARRAY< B_ENTRY >. The the
array is searched to the B_ENTRY that matches the unused portion of the
coordinates. The fact the least significant bits are used to access the array
itself tends to cause the DBLARRAY's to be allocated evenly, each array getting
one member from the most populous submatrix, 3 from the next three most
populous matrixes etc. Also in most cases the the arrays will be in frequency
order with the most frequent item accessed first.

To add a word into the sparse matrix, access is attempted, if the word is not
found, then a new B_ENTRY is added to the appropriate list. This is why
DBLARRAY's are used instead of simple arrays. Once the statistics are all
accumulated and no new entries will be required this flexible sparse
implementation can be converted to a static one with the 'void MakeStatic()'
method. All of the bigram counts are copied into an array ( _st_sp_bigrams ),
and all of the pairs of coordinates in the B_ENTRY's are copied into an
array of char ( _minor_idx ). The odd entries are the '_x' coordinates, and
even are the '_y'. An array of ints ( _major_idx ) has indexes into the
starting and ending points of the old DBLARRAY's that have been copied. So
the first item in '_major_idx' is 0 which means the contents of '_sparse_bi[0]'
are now stored in offset 0 of '_minor_idx' and '_st_sp_bigrams'. The second
entry tells where the first array ends, and the second begins. Access is much
like the sparse matrix implementation above: the least significant bits are
used to index '_st_sp_bigrams' and find out where the beginning and end of
the sub-array is. Then we loop through the array attempting to match the
most significant portion of the coordinates. If a match is found, we index
'_st_sp_bigrams' at that offset.

As in MONOGRMR, statistics are accumulated using a
'void ProcessBuffer( char* )' method. All pairs of words are counted. Pairs
where one or both words are not above the frequency threshold are counted as
statistics where one or both words are mapped to FC 0, or the "*OTHER*" word.
In order to make all the statistics compatible we assume that for each
call to 'ProcessBuffer', the first bigram is "*OTHER*" and the first word, and
that the last bigram is the last word and "*OTHER*". In order to make the
MONOGRMR and BIGRMR agree, an extra count is added to the MONOGRMR, as
described above.

BIGRMR also provides a 'void Verify()' method that checks that all the
statistics between the MONOGRMR and BIGRMR and within the BIGRMR corroborate.
So the sum of the column x of the bigram table is the same as the sum of
the row x of the table and is the same as the monogram count of the word
with fc x. This check works partly because of the normalizations described
above.

2.1.4 Word hashing

In order to count statistics on millions of words of text it is important
to be able to convert from orthographies into a numerical representation,
like a frequency code quickly. In order to do this, the MONOGRMR, which stores
this information, relies on a hashing scheme where orthographies are quickly
transformed into numbers which can be used to look up the FC. The hash numbers
are not unique, but in general should be evenly distributed among the world
of text strings, and should be computed fast. A macro ( HASH_WORD ) is used
for this purpose which sums the odd and even characters a in text string in
the low and high bytes of an unsigned short.

2.1.5 time stamps and registry

The monogrmr and bigrmr data structures can be quite large. They are in turn
used by higher level objects ( typically LM's ). Because they are large the
higher level objects refer to them through pointers and many objects may
point to the same MONOGRMR or BIGRMR. It is sometimes necessary to write
these higher level structures to disk and it is convenient to keep a copy of
the appropriate MONOGRMR of BIGRMR in each file. When the files are read back
it would be a waste to allocate many versions of the same large data
structures. At the time a MONOGRMR or BIGRMR is created the system clock
time is polled and put into '_time_stamp'. This field is them used as a unique
identifier of that object. '_registry' is a static array of pointers to
objects of the appropriate class. So for example all objects of class MONOGRMR
have a pointer to MONOGRMR::_registry and in turn MONOGRMR::_registry
has a pointer to each object that has been read from a file. Objects that
are created at run-time are not registered. When an object that contains
a MONOGRMR or BIGRMR is read from a file, the read routine should check all the
other objects in the '_registry' to see if the '_time_stamp's match.
is so the client object should deallocate the newly read MONOGRMR or BIGRMR
and refer to the one already in the registry. This way only one copy of each
MONOGRMR or BIGRMR will be read in from a file. It is possible on a machine
with a broken 'time_t time( time_t& )' system call, and very bad luck,
to have two different object share the same '_time_stamp'.

2.2 Word Classification algorithms

2.2.1 CLASSIFIER

Given a set of bigram and monogram counts, CLASSIFIER clusters words into
classes maximizing the predictive power of the classes. The details of the
algorithm are given in section one above. CLASSIFIER contains a pointer
to a BIGRAM container ( _bigrmr ), an array of shorts that hold a class
number for each word being classified ( _tags ), an array of pointers to
arrays of ints that hold the counts of class bigrams ( _bi_counts ), and
an array of ints with a monogram count for each class ( _mono_counts ). There
are also two counters, one holds the number of words classified ( _num_words ),
and the other contains the total number of classes defined so far by the
CLASSIFIER ( _num_pos ). The bigram table is always a matrix of
_num_pos * _num_pos. The monogram array is of length _num_pos and _tags is of
length _num_words.

CLASSIFIER is constructed from a BIGRMR. Initially there is only one class and
all of the words defined in the BIGRMR are assigned to that class. So
_num_pos = 1, the _tags array is initialized to every words having class
number 0, the _bi_counts table consists of a single entry at '_bi_counts[0][0]'
equal to the sum of the bigrams in the BIGRMR, and _mono_counts[0] is equal
to the sum of the monograms in the MONOGRMR. Although conceptually the class
monogram and bigram tables have only one entry, more space is allocated ahead
of time to reduce the amount of malloc'ing. The symbol CLS_MAX defines the
maximum number of classes that may be posited and is used to allocate
buffers at construction time.

The 'void Grow()' method is used to cause the CLASSIFIER to start splitting
each class in two, and distributing the words from the parent class into
the two children, maximizing the mutual information.

Testing the effect of moving a word from one class to another is accomplished
with the 'float Test( ... )' method. Test returns the change in entropy
due to moving the given word. To actually move a word from one class to another
the 'void Move( ... )' method is used. This function changes the '_tags'
value for a word and updates the class monogram and bigram tables.

The words are classified until splitting off new classes produces no decrease
in entropy. At this point the words are well trained on the input text but
may be overtrained with respect to similar unseen texts. To combat this
the 'void Blast( float Thresh )' method is used. First the CLASSIFIER's
statistics tables must be updated with statistics from another text, via a
call to the 'void Set( BIGRMR& )' method. 'void Blast()' tests the effects of
merging each pair of classes ( 'float MTest(...)' ), if the merge increases
the entropy by less than thresh, then the two classes are merged into one,
and the empty class is removed ( 'void Merge(...)' ).

The amount of computation required to classify words grows very quickly
as the number of words to be classified increases. It is sufficient to
classify a small number of words and then assign the remaining words into
established classes. This is done with the CLASS_EXT object described below.
Once CLASS_EXT has assigned words to classes, these words may be incorporated
into the CLASSIFIER with the 'void Extend( CLASS_EXT& )' method. This method
simply extends the '_tags' array to accommodate the new words and then updates
the class monogram and bigram tables according to the current '_bigrmr'.

After manipulating the class distributions with 'void Blast( float )' and
'void Extend( CLASS_EXT& )', the class distributions may no longer be at a
local minimum. The 'void Migrate( int )' method tries moving each word
from one class to another until a local minimum is reached.

After a number of classes have been merged it is possible for there to be
some number of classes with no member words. These classes would be taking up
space in the bigram table. The 'void Prune()' method removes array's from
the bigram table that are unused. The 'void Compress()' method moves changes
the class id numbers of classes so that the final set of classes has
contiguous numbers. It does not move words from class to class, but rather
just renames the classes.

The 'void Optimize(...)' methods simply offer a top level function to
apply some of the previously described functions.

2.2.2 CLASS_EXT, CLASS_EXT_BI, and CLASS_EXT_DATA

CLASS_EXT contains an extended word classification that can be added to the
word classification in a CLASSIFIER via the 'void CLASSIFIER::Extend(...)'
method. The CLASS_EXT object has an array of shorts ( _ext_pos ) that hold
the class id for each word that is part of the extended classification. The
indexes into '_ext_pos' are the FC's of the words with respect to the MONOGRMR
in the BIGRMR in the CLASSIFIER used to classify the words. The CLASS_EXT
object also has three counters, 'num_pos' represents the number of classes
used in the classification, '_num_words' represents the number of words
classified by the CLASS_EXT, and '_word_offset' represents the offset to the
word's FC that should be used to lookup the word's class in '_ext_pos'.
'_word_offset' is equal to the number of words already classified in the
given CLASSIFIER. The CLASS_EXT object is used only as a base class
for other CLASS_EXT subtypes that do the actual classification.

CLASS_EXT_DATA classifies words directly from input text, counting word
bigrams and monograms directly. CLASS_EXT_BI classifies words from a BIGRMR.
CLASS_EXT_BI is more efficient to use if the BIGRMR required already exists.

2.3 Language Models
2.3.1 The general LM interface

LM is the base class from which all language models are derived. It defines
a ( mostly abstract ) interface to the various LM's. There are several virtual
methods:
	'float LogProb( unsigned int )'		Returns the probability
						of the given word according to
						the model and an internal
						context.
	'unsigned int Id( char* )'		Given an orthography, return
						a context id. For many models
						this is the FC of a word
						relative to an internal
						MONOGRMR.
	'void SetContext( unsigned int )'	Sets the internal context of an
						LM ( if it has a context ).
	'int IsNotZero( unsigned int )'		Returns true if at least one
						observation of the target and
						the context was used to create
						the LM.
	'int Covered( unsigned int )'		Returns true if the target is
						not mapped to *OTHER*.
	'unsigned int NumWords()'		Returns the number of distinct
						targets available.
	'unsigned int NumPos()'			For LM's with classes, returns
						the number of classes being
						used.
	'unsigned int NumOther()'		Returns the number of targets
						mapped to *OTHER*.
	'LM_TYPE Type()'			Returns the type of the LM.
	'int ReadIn( FILE* )'			Reads the LM in from a file.
	'int WriteOut( FILE* )'			Writes the LM out to a file.

Reading in and writing out an LM through the generic LM interface requires
calling special non-class functions. The 'LM* ReadInLM( FILE* )' reads an
LM in from a file.  The first byte read represents the LM_TYPE of the
object that follows. This byte is then used to dispatch the correct LM
specific 'ReadIn' method. Likewise the 'int WriteOutLM( FILE*, LM* )'
function write out a byte representing the LM_TYPE and then dispatches the
appropriate 'WriteOut' method.

2.3.1.1 SCLM, DCLM

SCLM and DCLM are derived from LM. Many actual LM's ( like CLM, BLM etc. )
are in turn derived from one of these. These two classes manage the context
of a model. SCLM is used for models with only a single context, like the word
bigram language model ( BLM ), and DCLM is used for models with two contexts,
like the class trigram language model ( TCLM ). Both classes implement the
'void SetContext( ... )' method and the 'int Covered( ... )' method. The
contexts ( for SCLM '_pre' and for DCLM '_pre1' and '_pre2' ) are then
available to the derived classes.

2.3.2 LSLM

LSLM allows for the weighted combination of several LM's. The LM's are kept in
an array of LM*'s ( '_models' ) and the coefficients are kept in an array of
floats ( '_lambdas' ). A count of the number of LM's combined is kept in
'_num_lms'. The LM at '_models[0]' is considered the primary lm and is used
for all function calls where lambdas combination is not appropriate ( all
functions except 'float LogProb( int )' ). Typically the most powerful
language model should be the primary model ( BLM, CLM, and TCLM are usually
good choices ).

The lambdas should be computed by running 'void LamTune( ... )'. A
text representative of the recognition task should be passed in. At the
completion of the function, the lambdas will be tuned to within
LAMBDA_TUNE_THRESH of the 'ideal' values for that text. LAMBDA_CYCLE_SIZE
is the number of trials that will be read in between updates of the
lambdas. In general there is no advantage to making LAMBDA_CYCLE_SIZE larger
that the number of trial in the input text ( though it doesn't hurt ).
the convergence is more steady as LAMBDA_CYCLE_SIZE gets larger.

There is a lambda tuning algorithm that modifies the lambdas during recognition
to tune them for short runs of text during recognition. So as the text matches
a model better or worse during recognition, the lambdas will adapt. As of the
writing of this document this system is not yet working. The value LAMBDA_TWEAK
changes the amount by which the lambdas adapt at the end of each trial. A
value of 0 turns off the adaptation mechanism.

2.3.3 MLM

MLM is a monogram language model. Probabilities are produced according to word
frequencies without regard to any context. This is a poor model of a natural
language, but it is the a good model for most 'random' lists of words.

2.3.4 BLM

BLM is a bigram language model. Probabilities are assigned according to the
frequency of a word given its immediate predecessor. This is typically a
very powerful model, but can be extremely large with a great deal of
redundant information. In most real applications it will suffer from
under-training.

2.3.5 CLM

CLM is a class based bigram model. Probabilities are assigned according to
the frequency of a word given the class that it is assigned to and the class
of its predecessor. There are typically fewer than 300 classes and the model
is very small. It gives most of the performance of the BLM model in much less
space. This is partly due to the fact that it does not usually suffer from
under-training.

2.3.6 TCLM

TCLM is a class based trigram language model. It works just like CLM except
the previous two contexts are used. Typically it is smaller than the BLM
and slightly more powerful. Like BLM, it is difficult to fully train it.

2.3.7 XLM

XLM is a constant LM. It returns a constant probability based on the number
of targets it is told exist. In natural languages there are an infinite number
of potential target words so this model gives 'bogus' probabilities
( they do not sum to one ). This model should be used carefully. It is
sometimes useful in a lambda combination if the 'zeroes' of another model
are under estimated.

2.4 Support classes
2.4.1 DBLARRAY

DBLARRAY implements a doubling array templatized container class. It is
a simple form of a dynamic array that will allocate a new buffer of
double its current size when it runs out of allocation space. This
implementation does not use pointers but rather copies the top level
structure it is given into the array space. So for simple structures,
there is no need to malloc new little buffers each time. The same buffer
may be used and passed in and then re-used once DBLARRAY has copied the
contents out.

DBLARRAY_WO extends the DBLARRAY class to have 'ReadIn' and 'WriteOut'
methods. This class requires that the template class also have a
'ReadIn' and 'WriteOut' method.

DBLSTACK extends DBLARRAY to have 'Push' and 'Pop' methods so that the
DBLARRAY may be used like a L.I.F.O. stack.

UDBLARRAY imposes a uniqueness constraint, where two identical objects
may not be put on the same array. Each call to 'Add' checks if the new
object is already on the list and only adds objects that are not '=='
to objects already on the list. The template class must define an '=='
operation.

2.4.2 name

NAME is a trivial string class that allocates and de-allocates its own
memory.

2.4.3 data

DATA is a trivial class that allows a file name to be easily converted
into a buffer of text. The file is dumped complete into a buffer, and returned.

2.4.4 config

CONFIG implements configuration file handling. A CONFIG object is created
from a filename. The file is read and dumped into a buffer. The buffer
is then parsed and two arrays are created that point to the variable
names in one array and their values in another array.

The values of given configuration variables may then be accessed with the
appropriate 'GetParam()' method. The variable name and a pointer to
the program variable to be set are passed in. If the variable name is matched
in the config file, then the program variable is set according to the value
in the configuration file and an index into the config file line that was used
is returned. Otherwise -1 is returned. the returned index may be used to
search out all occurrences of the same variable in a configuration file.

2.4.5 gmalloc

gmalloc is a fast  malloc'er. It allocates pools of memory in 2^n sized blocks
and returns the smallest 2^n sized block that fulfills the user's request.
This in general wastes about 1/3 of program memory, but is very fast.

2.5 Programs
2.5.1 makelm

This program makes calculates various language models from input text data.
The program works in two stages. In the first stage it accumulates statistics
and classifies words. The second stage uses this data to create language
models. Each stage in turn has several sub-sections.

Roughly the first stage first collects monogram and bigram statistics
for growing classes. These statistics are distilled from text files
whose names are passed in after the -g option. Then words are classified,
with 'CLASSIFIER::Grow()'. Then monograms and bigrams for merging classes are
collected from files passed in after the -b option. Next classes are merged
using this data, with 'CLASSIFIER::Blast()'. Finally more words are classified
using the 'CLASSIFIER::Extend' method. There are three numerical options that
control this process. '-n' specifies the number of word used for the initial
classification. '-e' specifies the number of words to classify after the
initial classes are set. If '-e' is <= '-n', then '-e' has no effect. '-d'
specifies a threshold for merging the classes. '.00025' is a typical value.
Larger values will cause deeper pruning.

The second stage uses this data to make lm's. The '-k' option, specifies
the LM's to make. All LM types except LSLM can be specified with -k. For
LSLM, they must be specified with '-s'. '-l' specifies the file stem used in
naming the various lm files. '-t' specifies a list of text files for which
test set perplexity numbers will be generated. '-v' will cause the program to
verify that the LM's produced give fair probabilities ( that sum to 1 ).

Finally the first and second stages may be run independently. The '-w' option
causes the first stage statistics to be written out. '-r' option
causes the program to skip the first stage, read the statistics from a file,
and start right on to the second stage.

2.5.2 lmsmooth

lmsmooth reads in a set of LM's and lambdas and a training text and writes
out a lambda smoothed lm with lambdas optimized for the training text.
The program takes a single argument, the name of a configuration file.

The names of language models to be used should be specified with the
'language_model' parameter. The lambda to use for the given lm should be
specified with the 'lambda' parameter. The lambdas will be matched with
lm's specified in the order that they appear in the file ( first lambda
with first lm, second with second etc. ). The lambdas and LM's should match 1
to 1. The training text is specified with the 'lambda_training_text' parameter.

lmsmooth will also calculate test set perplexities for texts specified with
the 'test_data' parameter. Perplexities will be calculated automatically once
with lambda adaptation turned off. Lambda adapted perplexities can be
calculated by setting one or more 'lambda_tweak' parameters. Perplexities will
be calculated for each of the test files at each 'lambda_tweak' setting. A
constant LM will be smoothed into the new LM is 'use_constant_lm' is set to
true.

2.5.3 lmfilter

lmfilter simply loads an lm, given as the first argument, and produces
test set perplexities for a given text, the second argument.