Lexical Access for Speech Understanding using Minimum Message Length Encoding
Ian Thomas, Ingrid Zukerman, Jonathan Oliver, David Albrecht, Bhavani Raskutti
4464
Lexical Access for Speech Understanding using Minimum M e ssage Length Encoding Ian Thomas, Ingrid Zukerman, Jonathan Oliver, David Albrecht
Department of Computer Science
Bhavani Raskutti
Artificial Intelligence Section Telstra
Research
Laboratories Clayton, Victoria
AUSTRALIA b.raskutti @trl.oz.au Monash University Clayton, Victoria
AUSTRALIA {iant,ingrid,jono,dwa } @cs.monash.edu.au Abstract
The
Lexical Access Problem consists of determining the intended sequence of words corresponding to an input sequence of phonemes (basic speech sounds) that come from a low-level phoneme recognizer. In this paper we present an inf o rm atio n - th eo re tic approach based on the Minimum Message Length Criterion for solving the Lexical Access Problem. We model sentences using phoneme realizations seen in training, and word and part-of-speech information obtained from text corpora. We show res ults on multiple-speaker, continuous, read speech and discuss a heuristic using equivalence classes of similar sounding words which speeds up the recognition process without significant deterioration in recognition accuracy. INTRODUCTION
The
Lexical Access Problem consists of determining the sequence of words that corresponds to an input sequence of phonemes (basic speech sounds). A lexical access component is a major part of a speech recognition system which discovers the sentences (composed of words from a lexicon) that correspond to speech signals. If the sequences of phonemes we are given correspond precisely to words in the lexicon, we can imagine lexical access as a table lookup process, i.e., we simply select the words in the lexicon that have the same canonical phoneme sequences as the sequences in the input. However, in reality, the process of matching input phoneme sequences to words in a lexicon is more difficult, as these sequences may have extra or missing phonemes, and some phonemes may have been transcribed incorrectly. These insertions, deletions and substitutions may be due to ( 1) mis-recognition, through poor equipment, bad recording conditions, or poorly trained phoneme models; or (2) mis-pronunciation, where a speaker has said a word in a different way to the lexicon's canonical versions of that word. Mis-pronunciation is caused primarily by different dialects and accents. For example, the word "another" may be pronounced as several different sequences of phonemes: Lexicon entry1: Data: another ax n ah dh axr er n ah dh axr en ah dh axr ax n ah dh er Moreover, lexical access on entire sentences has the added uncertainty of knowing neither the number of words in the sentence nor the start and end points for each word in the sentence. Determining the boundaries of words in continuous speech is an extremely difficult task due to the above mentioned insertions, deletions and substitutions, and because often there are no word boundaries in the speech signal as a result of co-articulation -the "slurring" of sounds in continuous speech. The co-articulation effect can occur in any phonemes in an utterance, causing phonemes to be affected by their surrounding phonemes, which in turn leads to further mis-recognitions. In practical systems, lexical access is performed by attempting different partitions of the input phoneme string, and then looking for the optimal fitting of postulated words over the different phoneme subsequences resulting from these partitions (Myers and Rabiner, Lee and Rabiner,
In this paper, we describe a lexical access method for speech understanding based on the
Minimum Message Length ( MML) principle (Wallace and Freeman,
This principle provides us with a uniform and incremental framework for applying information from different sources, such as a language model, lexicon and prosodic information, to the lexical access problem. In Section we discuss related research. We then describe a method for evaluating a sentence in an informationtheoretic framework, and describe the search through the set of possible sentences corresponding to a given input. We conclude by discussing results obtained on a test set of sentences. RELATED RESEARCH
In our study, we have as input a string of phoneme symbols and hypothesize the sentence that corresponds to it. These symbols would be the best phoneme candidates from the
LPhonemes are described using ARPAbet symbols, which are used in theTIMIT corpus lFisher eta!., exical Access for Speech Understanding using
MML 465
Withdraw all phony accusations at once. withdraw I all I phony w i h dh D R ao lao L IF ow n y w ix th D R a a laa L IF ow nx y accusations I at !once AE K y uw z EY SH ix n z ae T IW AH N s AE K y ux z EY SH en epi z lq eh T IW AH N t s
Figure A typical TIMIT sentence aligned with canonical phonemes from words in the lexicon. output of a phoneme recognizer that works directly on the speech signal (Grayden and Scordilis,
In most highperformance speech recognition systems, the lexical access operation is integrated with phoneme recognition and language modeling in the form of a graph search through, in effect, a massive Hidden Markov Model (HMM) for sentences, e.g., (Gauvain et at., 1995,
Jeanrenaud et at., In our research, we isolate two parts of this process, namely word hypothesis generation and language modeling, in order to study their effect on the lexical access problem. Efforts in lexical access have often followed the hypothesize-and-test paradigm, where the waveform corresponding to a word is partitioned and each segment labelled according to relatively reliable information extracted from the signal (such as whether the segment is voiced or unvoiced). Word candidates that fit a given string of labels undergo a more detailed and time-consuming analysis to d e termine the candidate that best matches the waveform (Pissore et at., We use a similar method in our system. On a sentence level, the analysis of the sentence waveform to postulate individual words is often guided by detected word boundaries (Murveit et al.,
We use a mixture of an order two and order three Markov chain for the language model, but our word model is based on phonetic similarity, and can be generated directly from the training data without algorithms such as the Forwardbackward Algorithm (Baum, In addition, unlike other methods for encoding acoustic and language models, e.g., (Bechetet at.,
Zue and Lamel, we take an information-theoretic approach to representing data. Our work is similar to recent work on handwriting recognition (Bouchaffra et at., N -grams to estimate the parameters of the language model, while they use informative Dirichlet priors for estimating the same parameters. Our study was done on the TIMIT corpus (Fisher et al., 1986), which is a collection of American-English read sentences with correct time-aligned acoustic-phonetic and orthographic (word-aligned) transcriptions. The data set contains sentences spoken by speakers from different dialect divisions across the United States. Each speaker says five phonetically-compact sentences and three phonetically-diverse sentences to give a good coverage of the phonemes in the language. The sentences were recorded using a high-quality, headset-mounted microphone in a noise-isolated room, and speakers were instructed to read prompts in a "natural" voice. This training set generated training words, with a total lexicon of distinct words. An example of a TIMIT sentence aligned with canonical phonemes from words in the lexicon is given in Figure 1. The first row shows the words of the utterance, the second row the phonemes from the lexicon, and the third row the actual phonemes spoken by one of the TIMIT speakers. The probability of a sentence W * can be calculated as P(W*) = P(wrwzw3 ...
Wn) = P(wl)P(w2lwr)P(w3lwrw2) ... P(wnlwrwz . . . Wn-d·
However, it is not feasible to reliably calculate the conditional probabilities
P(wj lw1 • • •
Wj-l) for all words and all partial sentence lengths, therefore we estimate such probabilities by using an N -gram: P(wilw1w2 . . . Wj-1) � P(wilwi-N+l ... wi-d· N -grams over part-of-speech symbols and words can be collected from training texts. One difficulty encountered is that although TIMIT is excellent for providing many different examples of words in many different contexts, it is poor training data for a language model. This is due to the relatively small number of different sentences in the training corpus; but more particularly because the sentences are designed to be diverse and unusual. A language model trained on unusual sentences is unlikely to be generally useful, and its usefulness is even more questionable in the recognition of sentences from an unusual test set, as is the case in TIMIT. These facts make TIMIT a poor model for common English. This problem was handled using TIMIT for the mappings between words and phoneme strings, and using a set of classics texts1 to extract word and part-of-speech N -grams required for the language model (Section . This data provided many more co-locations of words and part-ofspeech symbols and more varied ones than was possible with the TIMIT data: approximately distinct part -ofspeech trigrams, compared with only from the TIMIT data. A small training set would cause more zero-frequency N -grams to be encountered during testing than a large training set, forcing us to use lower-context N -grams to estimate the probabilities of the components of higher-context N- of works such as Virgil's Aeneid and Emily Bronte's
Wttthering Heights. Thomas, et al. grams. This zero-frequency effect should be reduced by using a larger training set, but such a training set forces us to store far more N -grams that may ever be encountered in the input to the system. Finally, using different training sets produces different probabilities for the N-grams; we would hope that the chosen training set approximates the "correct" probabilities for any sentence in general
English.
4 METHOD
Our approach to the lexical selection problem is based on the
Minimum Message Length (MML) criterion (Wallace and Freeman,
According to this criterion we imagine sending to a receiver the shortest possible message that describes a sequence of input phonemes. Now, this message may be composed of the given phoneme sequence or of a sequence of words that correspond to this phoneme sequence, i. e . , a sentence. We postulate that a message that encodes a sequence of phonemes as a sentence will be shorter than a message that encodes it directly. Further, we postulate that the message describing the intended sentence will be among the sentences of shortest message length (hopefully the shortest). Thus, in finding the sequence of words that yields the shortest message given an input phoneme sequence we will have solved the lexical access problem. To find this sequence of words, we perform a search through a set of likely sentences, evaluating each sentence to find its message length. MINIMUM MESSAGE LENGTH ENCODING
We use a message of two parts to describe a sequence of phonemes: (1) model description segment that describes the word sequence that the string of phonemes represents; and (2) object description segment that describes for each phoneme in the input string the deviation from its corre sponding phoneme in the set of phonemes predicted by the word sequence. For example, the difference between "dh ae r" {a possible realization of "there") and the input sequence "dh ax r " is the substitution of "ax" for "ae". The message can be thought of as an explanation of the data. The first segment is a theory about the phonemes based on the words postulated for the sentence, and the second a description of the actual phonemes in terms of the model phonemes in the postulated words. If the theory explains the data well, then the data description will be short. If the theory is poor, then the data description will be longer, causing a longer message length. The "best" theory is the one with the shortest total message length. A complicated theory segment will not necessarily cause a long total message length, nor will a short theory automatically cause a short message length. The final length of a message depends both on the length of the theory and on how well the theory describes the data. The Minimum Message Length criterion is derived from Bayes Theorem: P(H&D) = P(H) >< P(DJH), where H is the hypothesis and D is the data. An optimal code for an event E with probability P(E) has message length ML(E) = -log2(P(E)). Hence, the message length for a hypothesis given the data is: ML(H&D) = ML(H) + ML(DjH), which corresponds to the two parts of the message.
The minimization of
ML(H &D) is the criterion for model selection. The relationship between MML and Bayesian poste rior maximization is discussed in (Oliver and
Baxter, . EVALUATION OF A SENTENCE
The description of a sequence of phonemes is made up of two main parts: (1) Language Model, which describes the words; and (2) Phoneme Realization Difference, which describes the phonemes corresponding to this model, w ith an error function that describes the difference between the phonemes in the language model and the actual phonemes. Language Model
Here we have the task of encoding the actual words that make up our sentence. Since a "sensible" sounding sentence would be better than a sentence composed of unrelated words, we require an encoding that will describe sensible s en t enc e s in as few bits as possible. To express how s e n si ble a combination of words in a sentence is, we take into account the syntactic role of the words in this combination as well as the actual usage of these words. The former is done by preferring frequent part-of-speech combinations, e.g., an article followed by a noun, to infrequent ones, e.g., an article followed by another article; and the latter by preferring common word combinations. Let W * = w1, w2, ... , Wn be a sentence where word w; has been instantiated with part-of-speech PoS; (a word could correspond to different part-of-speech symbols). wz PoSz
We take the probability of sentence W * as: n P(W*) =II P(w;, PoS; j words before position i). i=l At present, we use trigrams for parts-of-speech and bigrams for words, yielding: n P(W*) =IT P(w;,PoS;jw;-!,PoS;-J,PoS;_2) i=l n = IT P(w;jPoS;,
W;-!,
PoS;-1, PoS;-2) i=l x P(PoS;jw;-l, PoS;-1, PoS;-2)·
To reduce the complexity o f the problem we assume that given the part-of-speech of word w;, and given word
W;-J, word w; is conditionally independent of the previous partsof-speech PoS;-1 and
PoS;-2: P(w;jPoS;, w;_1, PoS;_�, PoS;_z) = P(w;JPoS;, W;-I), and that given the previous parts-of-speech
PoS;-1 and PoS;-z, the part-of-speech
PoS; is conditionally independent of the previous word w;-1: P(PoS;jw;-I, PoS;-J, PoS;-2) = P(PoS;jPoS;_1, PoS;-2)· exical Access for Speech Understanding using
MML 467
Using these conditional independence assumptions, we get: n P(W*) = TI P(wdPoSi,wi_J) i=l xP(PoSiiPoS;-1, PoS;_z).
The part-of-speech trigrams are estimated from frequencies of a training corpus according to the formula F(PoS·IPoS- PoS·_ ) == F PoS;,PoS;_�,PoS;-2 � t ) , t 2 F PoS;_1,PoS;-2 ' where F(x) is the frequency of x seen in the training data. A similar formula is used to estimate P(w;jPoS;, w;_I).
The larger the number of different symbols used in anN gram and the larger the value for
N, the less likely it is that a given set of training data will have instances of every Ngram. We handle the Zero Frequency Problem by using a back-off p r o cedu r e to a lower context N -gram. This backoff is indicated by an escape code of low probability that is added to the code space of the N -gram (Witten and Bell, For example, if we have no instances of a particular part-of-speech trigram, we indicate this using the escape code, and then use a bigrarn. For words, we back-off from bigrams to unigrams (there is assumed to be one instance of every word in the lexicon in the unigram set). To illustrate the above process, consider the following sentence fragment: PoS symbols: article adjective adjective noun
Words: the quick brown fox which is encoded as the product of the following probabilities:
P(theleos, the= art ) P(quickjthe, quick= adj)
P(brownlquick, brown= adj)
P(foxjbrown,jox =noun)
P(artleosl, eos2)1 P(adjjart, eosi)
P(adjladj, art) P (no u nj a dj , adj) A search through all possible sentences with this evaluation scheme would tend to generate the most co m mo n pos sibl e sentences, without regard to how closely they match our actual phoneme input. The second part of our model tempers this effect by incorporating the actual phoneme data. Phoneme Realization Difference
We have a set of words and a partition of the stream of phonemes into the words in the set. We have described the words; now we can make a hypothesis about the phonemes that correspond to these words, and we can measure how close re al i t y is to our hypothesis. Initially, we us e the training set to generate for each word in the lexicon a set of possible phoneme realizations tog e t h er with the frequency of each realization. We use an edit distance algorithm (Sandoff and Kruskal, 1983) to work out the realization of a lexicon word which is closest to our segment of actual phonemes. This algorithm yields an o p tim a l alignment of the in pu t p h on em es with those in lex icon words using weights for insertions, substitutions and All sentences start with eos I (end of string) and eos2 part of-speech tags, and an eos word. deletions which were obtained from training data (Thomas et al., forthcoming). For example, phonetically similar substitutions are given a low cost (e.g., vowels for vowels). As stated above, in the framework of t h e MML principle, the central idea is t h a t the message sent to a receiver must contain enough information so that the receiver can reconstruct the input phonemes. We postulate that the shortest message will be composed of the lexicon word whose realization best matches the given input, the realization in question, and a record of the operations required to transform this realization into the actual input phonemes. Recall that the language model is being used to send the lexicon word. Hence, at this stage we only need to send the other two components. For example, suppose we wish to send the phoneme string " ax n dx q er", and we hypothesize that it corresponds to the word "another". The fo l l ow i ng realizations of this word have been seen in training: Frequency Word Phonemes another1 IX n ah dh axr anotherz ax n ah dh er another3 q ax n ah dh axr another4 q ax n ah dh er l anothers q ax n ah dh ax another6 IX nx ah dh uh another'] er n ah dh axr In this example, the edit distance algorithm chooses the second realization, and it takes I Frequency of another2 - og2 = -log2(3/12) L i Frequency of anotheri bits to indicate to the receiver which phoneme realization we are going to use. However, the input phonemes do not correspond exactly to this realization, so we calculate an optimal alignment between our hypothesis phonemes and the input phonemes. For instance, for the second realization, the optimal alignment is "anotherz" ax n ah dh er input ax n dx q er where "dh" and "dx" are aligned because they are phonetically similar (they have a low substitution cost). Since we have already specified which of the phoneme realizations to use, we have already encoded the h yp o th e sis phonemes (the top line). We must now send the actual phonemes (bottom line), which are sent as a sequence of insertions, deletions and substitutions. This is performed in two stages: (1) sending insertions, and (2) sending the rest of the operations. Sending insertions.
Insertions are special because they increase the number of operations in the alignment, and therefore have to be specified using position indicators. To send insertions, we first specify how many insertions there are in the alignment, so that the le ng th of the part of the message that describes which phonemes were inserted can be determined. This information takes -log2(P(N insertions)) bits to transmit (in our example only one insertion was performed, namely "q" between "dx" and "er"). Thomas, et al.
The probability of having N insertions is estimated from the training corpus as f o l l ow s : l of the database of realizations seen in training is removed; each realization in this subset is optimally aligned with the remaining re a liz a ti o ns for the same word in the database; and the number of i n s ertions in the closest alignment is recorded. This process is repeated by removing a di f f er e nt I 0% of th e database until all p h o n eme realizations in the original database have been used once. Next, we need to specify what the inserted phonemes are and the position of the insertions in th e alignment. It takes log2 C :kr = log2 (L-t')!Nr b i t s to encode the pos it i ons of N insertions in an al ig nmen t of length L. In our example, the alignment length is (the number of phonemes in the chosen lexical realization (5) plus the number of insertions ( 1) ) , yielding log2 b it s. We complete the insertion portion of the message by a d d i n g to it information about the actual in sert io n performed. However, since we have already allocated space in the alignment for the i nse rt i o n , it has turned into a substitution, which can be han dled in the next step of the message sending process (it is now a"-" in the phoneme realization of the lexicon word, which is being substituted by an input phoneme). Sending the rest of the operations.
This is performed from left to right in th e alignment. Since we know the top line of the optimal alignment, the bottom line is conveyed by sending a list of conditional pr obab ili t i e s of each input phoneme given the corresponding intended phoneme, which are computed from the training corpus, e.g., for "another2" the probabilities are P ( ax j a x ), P(njn), P( -lah), P ( dx j dh ) , P(ql-) and P(erjer). These probabilities correspond to two exact matches, a deletion, a substitution, an insertion and an exact match.
Exact matches are common and thus take only a few bits to encode. In this manner, an input that closely matches one of the phoneme realizations will be encoded in fewer bits than one that is very diff er e nt . Summary of Sentence Evaluation
In summary, given a sentence W * = w1, w2, .. . , Wn where a word Wi has p art - of - s pe e c h PoSi, and a partition of the input phoneme sequence i nt o n segments phs�, .. . ,phsn. we are trying to minimize the following: codelength(W *) = E �=1[codeJength(w;, PoS;) + codeJength(phs;, w; ) ] , where code_length(w;, PoSi) is the n umb e r of bits r e quired to se n d w; and PoS; (estimated from the language model), and codeJength(phs;, w; ) is the number of bits required to send phs; given w; ( e s ti mat ed from the phoneme realization difference):2 codelength(w;, PoSi) = -log2 [P(wilPoS;,Wi-d x P(PoS;\PoS;-I,PoSi-2)] codeJength(phs;, w;) is used instead of codeJength(phs; lw;) because it is a computational function call. codeJength(phs;, w; ) = m inrealizations j of w; [ codeJength(phRealw, ,j) +codelength(insertionsphRealw,,) +codelength(substitutionsphRealw; . ; ) ] , where phRealw; ,j is the jth realization of lexicon word w;, codeJength(phRealw;,j):::: - log2 {r;qw;.; . j reqwi ,3 codeJength(insertionsphReal .. ) = 'W'I- ,J l length of alignment og2 cnumber of insertions codeJength(substitutionsphRealw;.f) = - log2 P ( input phoneme l intended phoneme ) . phonemes in alignment The argument j that yields codeJength(phs;, w;) is the real i z a ti o n j of lexicon word w; wh i c h yields the shortest encoding for the phoneme segment phsi. We have described a way to evaluate the quality of a se quence of word hypotheses that describe our input ph o n e mes. But how do we search through t h e possible space of words and word bou nd ari e s? Evaluating all possible se t s of words with all possible sets of word bou n d ari e s would be computationally prohibitive. Thus, we follow an "optimistic selection" principle to e l i m inate partial sentences that are deemed unpromising during p r o c e ssi n g . This principle is i m pl e mente d as follows. Each time a word is added to the current candidate partial senten ces, those that take many more bits to encode than the best partial sentence at this stage of processing are eliminated from further consideration. This is because the partial sen tences which take more bits to encode are unlikely to overcome this disadvantage and proceed to become the overall best when the whole sentence is eventually generated. In our implementation we use a modified version of the Level-buildingalgorithm, which proposes word boundaries and expands all partial sentence h y p o t h e s e s a word at a time (Myers and Rabiner, At the end of each "level" ( n e w word) we prune the partial sentence hypotheses using a beam threshold (Lowerre and Reddy, Th i s allows us to have a strict control over the total number of partial hypotheses during the search. The phoneme slots that are g enera t ed by the algorithm are filled with suitably sized words, so that a slot with a few phonemes is not filled with a long word and a slot with many phonemes is not fill e d with a short word. The resulting words are then assigned different part-of-speech tags, and evaluated with each tag as described in Section To evaluate a particular word, optimal alignments are carried out be t w e e n its phoneme realizations and the input phonemes. This process is quick due to the small number of phonemes in most words. However, the evaluation of all th e words in the lexicon (comprising thousands of words) for a given phoneme string segment is a time consuming process. To reduce the number of word candidates to be evaluated at each point, we generate a short-list of likely exical Access for Speech Understanding using MML 469
Table Breakdown of the message length for two sentences from the TIM IT test set.
Best Sentence Correct Sentence word PoS bits word phD iff t o t al word PoS bits word phDiff total bits bits bits bits t he dt 4.22 0.94 3.18 8.34 the dt4.22 bank nnp 5.45 8.74 18.86 33.05 - low nnp 2.76 13.40 4.80 20.96 bungalow nn rb in dt shore nn Table Phoneme realizations seen in training, for best and correct words of the first example of Table I. Correct the bungalow was pleasantly sit!W.ted near the shore sentence dh ax bahnggaxlow waxz p I eh zen t I iy s ih ch uw ey t ix d nih axr dh ax sh ao r phoneme dh ix w ixz n ih r dhix realizations dhiy wahz nih er dhiy Input dh ax bah ng g elow wahz p lahzenlix six ch ax-h w ix dx ih d nih dhix sh ao r Phonemes
Best bank low sentence baengk low re a l i z ation s bay n g k possibilities. This is achieved by first encoding the phoneme realizations of words obtained from training as broad sound group sequences (e.g., the input sequence "dh ax r", which is a realization of "there", may be encoded as "stop vowel glide") and classifying each of these sequences into Equivalence Classes, such that each class contains phoneme realizations that are similar to each other, but each class as a whole is different from every other. The number and composition of these classes are optimized using the MML principle (Thomas et al., I996), yielding classes with the current TIMIT data. During the search, a new string of input phonemes is placed in the "best" class according to a similarity measure which compares the sequences of broad sound groups corresponding to the input phonemes to those in each class. The words that correspond to the phoneme realizations in the chosen class become the short-listed candidates to be evaluated as described in Section disadvantage of using the short list is that the input phonemes may be put in a class that does not contain an instance of the correct word, and therefore the correct word will not be evaluated. Thus, the use of the short-list increases speed of recognition at the cost of losing some accuracy.
5 RESULTS AND DISCUSSION
Table l shows the evaluation of a sample sentence from TIMIT's test set. We show the best sentence hypothesis after search, and also the sentence with the correct words and word boundaries for comparison. The table also shows the breakdown of the message into the number of bits required to encode the part-of-speech N -grams, the word N -grams given the part-of-speech symbol and the phoneme differences. A relatively large part-of-speech cost indicates an N-gram of low probability, but that cost may be offset by a relatively small cost to encode the word given that partof-speech symbol or the phonemes given our chosen word. W hen the correct sentence requires fewer bits than the sentence found through search (as is the case with the example in Table I), this indicates that the search was unable to find the optimal (correct) solution- the correct hypothesis was either not suggested or pruned out early. The converse suggests that the sentence actually found was a better hypothesis (a closer match between the phonemes, or a more likely word or phoneme sequence according to the language model) than the correct sentence. This could be due to the correct sentence being unusual and badly mismatched to data from training. An unusual sentence needs to have a good match between input phonemes and phoneme realizations of lexicon words in order to compensate for the low probability given by the language model. Table shows the sample sentence of Table I in further detail, showing the input phoneme sequence, and some of the phoneme realizations of words for both the sentence found through search and the correct sentence. This example illustrates that it can be very difficult to distinguish between words that are phonetically similar given possible errors in the input; "bungalow" matches the input phonemes better than "bank low". However, the search did not suggest "bungalow" as a possible word candidate as it didn't belong to the shortlist generated for the input phonemes. The left-hand side of Table shows the results of the recognition of words in the standard core TIMIT test set, which contains different sentences from different speak ers. None of the sentences were previously seen in training and none of the speakers were used in the training set. s "' a: g IU Thomas, et al.
Table Average word error rates per sentence for differing levels of distortion, using word encoding.
160 140 120 100 80 60 40 20 0 3
Distortion Average Part-of-Speech threshold number and Word Encoding
10 20 30 40 of words average average number number of ins of dels • . 0 • • . . . .. .. '\ ,o. •• " •o- • • ::.. .. �.' : • 0 • • -o- • .,. • 0 \ • • • • • • • • • • .,• :"" • : • o• .. .;to•: o'•"' �. • • . ..
4 4.5 5 Bitstf'honeme 5.5 6 average number of subs
Figure Error rate for bits/phoneme of sentences. The table shows the average number of words in the input sentences, and the average number of word insertions, deletions and non-exact substitutions in the alignment of the words in each of these sentences and the corresponding correct sentence. This is followed by the average word error rate, which is the sum of word insertions, deletions and non-exact substitutions divided by the true number of words. We further categorize these results according to the Distortion Rate.
This is an estimate of how closely an input sequence of phonemes matches the sequence of phonemes for the corresponding correct sentence. It is calculated by first extracting the phoneme sequences that make up the correct words in the input sentence according to the correct word boundaries, and then finding the optimal alignment of each of the phoneme realizations of each correct word with the input phoneme sequence for that word. The number of insertions, deletions and substitutions in the closest alignment of a phoneme realization and an input phoneme sequence is recorded for each of the words in the sentence. The distortion rate for a sentence is the ratio of the number of non-exact phoneme matches (insertions, deletions and non-exact substitutions) to the total number of phonemes in the input phoneme sequence. For example, the distortion rate for the sample sentence in Table is calculated as non-exact operations (2 from "bungalow", from "pleasantly", from "situated" and from "near") divided by Word Encoding average average average average average word number number number word error rate of ins of dels of subs error rate phonemes in the input sequence, giving a distortion rate of
In Table each line shows the average word error rate for all sentences which have a lower distortion rate than the value in the first column (e.g., for all sentences with less than distortion, the word error rate was In order to determine the effect that part-of-speech symbols have on the results, we analyzed the same set of sentences using a language model that contains word bigrams only (right-hand side of Table
The language model was calculated using the formula n P(W*) =IT
P(wiiWi-d· i=l
Comparing the left and right sides of Table we can see the improvement in the recognition accuracy when part-ofspeech information is incorporated in the language model. Figure is a plot of the bits per phoneme versus the word error rate for each of the sentences. We can see that as the number of bits per phoneme increases so too does the error rate. This would allow a recognizer to assess the likely accuracy of a found sentence-a high number of bits per phoneme would suggest a high error rate. At present, the average word error rate is This recognition accuracy could be improved by careful weighting of the probabilities so that infrequent words still have a chance of being accepted if their phoneme realization is close to the input phonemes, or by using a more advanced method for modeling co-articulation between words than the one currently used (not described here due to space limitations). Other features of the signal such as phoneme duration can be also included as a further source of information. Finally, the input to our system is a set of hand-marked phonemes, with no indication of the confidence of the markings. In a realistic system, the input would be a lattice of phoneme candidates and their probabilities. Such information could be factored into the encoding of a sentence; for example, if we have a high degree of confidence in a particular phoneme, then this might reduce the probability of non-matching substitutions involving this phoneme in the optimal alignment. Receiving as input a phoneme lattice would also allow us to compare our results with those obtained by other systems, e.g., (Rabiner and Juang, exical Access for Speech Understanding using
MML 471 6
CONCLUSION
We have shown the utility of Minimum Message Length Encoding for modelling spoken sentences. As with most corpus-based learning schemes, the quality of the training data is important. Given the diverse speakers and sentences involved, the recognition results are encouraging. Part-of-speech data generally improved the number of bits required to send phoneme information. More importantly, it produced a shorter encoding of word hypotheses that were sensible in English, resulting in fewer word errors. There was some concern over the lack of training data for the language model.
This problem was reduced by using a larger body of classic texts, which supplied most common word N -grams, and more importantly, yielded a set of partof-speech N -grams used commonly in English. The use of short-lists of word candidates was shown to improve substantially the recognition speed of the system by quickly finding likely candidates to investigate more carefully, with minimal effect on recognition accuracy. We are considering performance enhancements to the search, and also the evaluation of sentences, which is a major bottleneck of the system. Specifically, we are investigating a method for exploiting the optimistic selection principle during the evaluation of single words as well as partial sentence hypotheses. This allows a number of word evaluations to be carried out in lockstep, and allows the elimination of unlikely word candidates during evaluation. Acknowledgments
This research was supported in part by grant N016/099 from the
Australian
Telecommunications and Electronics Research Board, and ARC Fellowship F39340
I ll .
References
Baum, L.E., An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of Markov Processes.
Inequalities 3,
F.,
Meloni, H., and Gillies, P., Knowledge-based Lexical Filtering: The Lexical Module of the SPEX System. In
Proc. of the Fifth Australian International Conference on Speech Science and Technology,
V.,
Srihari, R.K., and Srihari, S.N., Integrating Signal and Language Context to Improve Handwritten Phrase Recognition: Alternative Approaches. In
Proc. of the Sixth International Workshop on AI & Statistics,
Proc. of the DARPA Speech Recognition Workshop,
Report No. SAIC-86/1546. Fissore, L., Micca, G., and Pieraccini, R., Strategies for Lexical Access to Very Large Vocabularies.
Speech Communication
7, 355-366, 1988. Gauvain, J.L., Lamel, L., and Adda-Decker, M., Developments in Continuous Speech Dictation using the ARPA WSJ Task. In
ICASSP � Proc. of the
IEEE
International
Conference on Acoustics, Speech and Signal Processing,
ICASSP � Proc. of the IEEE
International Conference on Acoustics, Speech and Signal
Processing,
J.,
Ng, K., Siu, M., and Gish, H., Reducing Word Error on Conversational Speech from the Switchboard Corpus. In /CASSP � Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing,
IEEE Transactions on Acoustics, Speech and Signal Processing 37(1 1),
Trends in Speech Recognition,
Lea, W.A. (Ed.), Prentice-Hall, 340-360, 1980. Murveit, H., Weintraub, M., Cohen, M., Bernstein, J., and Rudnicky, A., Lexical Access with Lattice Input. In
ICASSP � Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing,
IEEE Transactions on Acoustics, Speech and Signal Processing
Rabiner, L., and luang, B ,
Fundamentals of Speech Recognition,
Prentice Hall, 1 993.
Sankoff,
D. and Kruskal, J.B.,
Addison Wesley, London, 1983. Thomas, I.E., Zukerman, and Raskutti, B., Accounting for Pronunciation of Phonemes in Corpora. In
Proc. of the Second Conference of the Pacific Association of Computational Linguistics, forthcoming. Thomas, I.E., Zukerman, I., Oliver, J.J., and Raskutti, B., Lexical Access using Minimum Message Length Encoding. In
PR/CA/'96-Proc. of the Fourth Pacific Rim International Conference on Artificial intelligence,
Cairns, Australia, Springer-Verlag Berlin, 229-240, 1 996. Wallace, C.S. and Freeman, P.R., Estimation and Inference by Compact Coding.
Journal of the Royal Statistical Society (Series B)
IEEE Transactions on Information Theory
V.W. and Lame!, L.M., An Expert Spectrogram Reader: A Knowledge-Based Approach to Speech Recognition. In /CASSP-Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing,/CASSP-Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing,