Evolutionary optimization of contexts for phonetic correction in speech recognition systems
Rafael Viana-Cámara, Diego Campos-Sobrino, Mario Campos-Soberanis
aa r X i v : . [ ee ss . A S ] F e b Evolutionary optimization of contexts forphonetic correction in speech recognitionsystems
Rafael Viana C´amara, Diego Campos Sobrino, Mario Campos Soberanis
SoldAI Research, Calle 22 No. 202-O, Garc´ıa Giner´es, 97070 M´erida, M´exico { rviana,dcampos,mcampos } @soldai.com Resumen
Automatic Speech Recognition (ASR) is an area of growingacademic and commercial interest due to the high demand for appli-cations that use it to provide a natural communication method. It iscommon for general purpose ASR systems to fail in applications thatuse a domain-specific language. Various strategies have been used to re-duce the error, such as providing a context that modifies the languagemodel and post-processing correction methods. This article explores theuse of an evolutionary process to generate an optimized context for aspecific application domain, as well as different correction techniques ba-sed on phonetic distance metrics. The results show the viability of agenetic algorithm as a tool for context optimization, which, added to apost-processing correction based on phonetic representations, can reducethe errors on the recognized speech.
Keywords:
Speech recognition, Phonetic distance, Genetic algorithms.
1. Introduction
Automatic speech recognition (ASR) systems are of great relevance in aca-demic and business environments due to the ease of interaction they offer. Therehas been a growing interest in investigating these systems, which have migratedfrom probabilistic models to deep neural network systems [5] that have becomethe standard for professional audio-to-text transformation applications. Deepneural network systems for audio-to-text transformation often use an acousticmodel to perform recognition at a first level and are later passed to languagemodels for correction [9]. Commercial services generally operate as a black box,making it difficult for the user to modify language models.While ASRs generally perform well, they often run into problems when usedto recognize specific language domains, so post-processing techniques becomerelevant [4].Many of the post-processing and correction tasks of these systems use acontext, understood as a set of words, phrases, and expressions related to theparticular domain that it is desired to recognize. Some of them have mechanismsto provide a context with which they improve the recognition of certain words and phrases. However, in many cases, it is not enough to significantly improvetheir performance.Two particularly interesting topics are the generation of contexts and thephonetic representation for correction. Research has been carried out in thisregard; however, there is a lack of experimentation related to the optimal contextgeneration and phonetic representation’s joint operation.This article presents a method for generating contexts using genetic algo-rithms to correct the output of the
Google speech-to-text processing system.Next, the error correction process’s comparison of different critical strategies iscarried out: representation of the sentence to be corrected, candidate selection,and comparison metrics.The article is structured as follows: Section 2 describes the background tothe problem and related work; Section 3 presents the methodology used for theinvestigation; In section 4, the experimental work carried out is described, theresults of which are shown in section 5 and finally in section 6 the conclusionsare provided along with some ideas to develop as future work.
2. Background
The error correction algorithms in ASR systems have been approached fromdifferent perspectives, including phonetics. Kondrak [11] proposes an algorithmto calculate a metric of phonetic similarity between segments using multivaluedarticulatory phonetic characteristics. The Kondrak algorithm combines sets ofedit operations and local and semi-global alignment models to calculate a set ofnear-optimal alignments.Pucher et al. [16] present word confusion matrices using different measures ofphonetic distance. The metrics presented are based on the minimum editing dis-tance between phonetic transcriptions and the distances between hidden Markovmodels. His research shows a correlation between edit distance and word confu-sion in ASR systems, so these types of corrections become useful for rectifyingrecognition errors.In [2] the problem of using the editing distance to compare strings in langua-ges like Korean, where characters represent syllables instead of letters, is high-lighted. This is reflected in the fact that substituting one syllable for anotherprovides the same value regardless of the difference between its letters. The tra-ditional solution uses hybrid metrics between characters and syllables; however,the authors argue that this approach does not satisfactorily solve the problem,so they propose an editing distance based on phonemes as a solution.Droppo and Acero [9] use the phonetic editing distance to incorporate athird element of correction to ASR systems. They incorporate this distance tolearn the relative probability of phonetic recognition strings, given an expectedpronunciation. This strategy considers the context of the transcripts, changingthe probability of correction depending on the words before and after.Bassil and Semaan [4] employ a post-processing strategy for error correctionin ASR systems. The presented method for detecting word errors uses a can- didate generation algorithm and a context-sensitive error correction algorithm.The authors report a significant reduction in system errors.In [7] phonetic correction strategies are used to correct the errors generatedby an ASR system. In the cited work, the system’s transcript is transformedinto a representation in International Phonetic Alphabet (IPA) format. A slidingwindow algorithm is used to select candidate phrases for correction according tothe words provided, in context and distance to its phonetic representation in IPAformat. The authors report an improvement in 30 % of the phrases recognizedby the
Google service.An important component for the phonetic correction algorithm is the contextused to construct the candidate phrases, so solutions capable of finding optimalconfigurations among vast search spaces are needed.Genetic algorithms are stochastic search algorithms based on biological evo-lution principles and emulate the process through genetic operators applyingrecombination, mutation, and natural selection in a population [13,18]. Theyhave been applied to solve complex combinatorial problems, and the resultsshow that they constitute a powerful and efficient strategy when used correctly.[18].These types of algorithms have been used to analyze a large number of pro-blems, including knapsack [18], process scheduling problems, the traveling sa-lesperson [13], search for functions for symbolic regression [1], Gaussian kernelfunctions for sentiment analysis [17], among others.When using genetic algorithms to solve a problem, possible solutions are ex-pressed as a chain of symbols called a chromosome, where each symbol is a gene.From an initial generation of individuals, the processes of selection, mutation,recombination, and evaluation are iteratively executed, combining the indivi-duals’ genes to produce new variations. Each individual is evaluated accordingto a function called fitness that describes how well they perform to solve theproblem.
3. Methodology
The correction algorithm uses the components of a context (set of wordsand phrases belonging to the application domain) to detect possible errors inthe recognition and correct the transcript of an automatic speech recognitionsystem. It comprises three main elements: phonetic representation, candidatephrase generator for correction, and editing distance metric. As a measure ofevaluation of the results, the WER metric (
Word Error Rate ) was used, whichis defined as follows:
W ER = S + D + IN (1)where S is the number of substitutions, D the number of deletions, I thenumber of insertions required to transform the hypothetical phrase into theactual phrase, and N the number of words in the actual phrase. Phonetic transcription is a system of graphic symbols that represent thesounds of human speech. It is used as a convention to avoid the peculiarities ofeach written language and to represent those languages without a written tradi-tion [10]. We use as phonetic representations: plain text, IPA, Double Metaphone(DM), and a variant of Double Metaphone with vowels (DMV).The IPA is a phonetic notation system based on the Latin alphabet, usedas a standardized representation of the sounds of the spoken language [6,19].Metaphone is a phonetic algorithm that is in charge of indexing words by theirpronunciation from the English language [15]. The DM algorithm is an improvedversion of the Metaphone algorithm, which returns a representation of the letters’sound in the string when the text is spoken and omits the vowels. The DM hasoften been used to represent the English language; however, vowel sounds areof importance in Spanish because they serve the Spanish speaker to link wordsthat end in consonant groups [8], so we developed a variant of the DM whichadds the vowels that are removed in the original algorithm.
During the phonetic correction process, the search for candidate phrases ge-nerates segments of the input string that will be contrasted employing a distancemetric with the words and phrases in the context. A candidate phrase is one thatis similar to one of the phrases in the context and that may contain an error inthe ASR transcript. The experimentation was done utilizing the pivotal window,and the incremental comparison algorithm (based on the phrase’s size in lettersor syllables), were used as algorithms for generating candidate phrases.In [7] the sliding window strategy is presented where a set S j is genera-ted with a window v = 1. The selection of sub phrases is done using a pi-vot p j and the set S j of candidate sentences is generated by the sub phrases { p j , p j − p j , p j p j +1 , p j − p j p j +1 } .An incremental sub phrase search method was implemented for this article,which is described below:Let C = { c , . . . , c n } the set of n context-specific phrases, and T = { t , . . . , t m } the original transcript divided into m words, it is intended to builda set R = { ( s , c ) , . . . , ( s l , c l ) } formed by pairs ( s i , c i ) such that c i is an elementof the context capable of substituting the segment s i = t j ...t k for some j ≤ k in T . The algorithm 1 uses a strategy, in which each word t i in transcript ( T ),the sub phrase s to be evaluated is incremented word by word until there areno more elements of the context comparable for their size in letters or syllablesaccording to the u threshold. The possible substitutions of s for c are added tothe set R as long as their distance is less than u . The algorithm’s complexity is O ( nm ) where n is the number of elements in the context and m the size of thetranscript in words. Algorithm 1
Candidate incremental search algorithm
Input:
The context C = { c , . . . , c n } , the transcript T = { t , . . . , t m } , a distancethreshold u , a distance metric function d ( a, b ). Output: a set R = { ( s , c ) , . . . , ( s l , c l ) } of candidate substitutions.1: Calculate maximum phrase size L M in C .2: Initialize the set R = {} for i ← . . . m do s ← t i j ← i while j ≤ m & length ( s ) ≤ L M − u do for all c ∈ C | length ( s )(1 − u ) ≤ length ( c ) ≤ length ( s )1 − u do if d ( s, c ) < u then
9: Add the pair ( s, c ) to the set R end if s ← s + t j j ← j + 113: end for end while end for return R The edit distance is used to quantify the difference between two text stringsin terms of the number of operations required to transform one string into theother. This work experiments with Levenshtein, Damerau-Levenshtein distancemetrics, and Optimal Chain Alignment (OSA).The Levenshtein distance between two character strings is the number of in-sertions, deletions, and substitutions required to transform one character stringinto another [12]. The Damerau-Levenshtein distance can be intuitively definedas an extension of the Levenshtein distance by adding the transposition of twoadjacent characters [3] as a valid operation. OSA is a restrictive variation of thedistance
Damerau-Levenshtein , where the transpose operation can only be per-formed once per character [14], which makes it less computationally expensive.
For the generation of contexts, it was decided to use a genetic algorithmconstructed from the sentences’ transcripts to be corrected. Each individual re-presents a possible context. For the individuals’ construction, all the words wereconsidered individually, and the combinations of 2 words (bigrams) present inthe target sentences of the original audios. Individuals were defined by a chro-mosome where each gene takes the value 1 if the word or bigram is in the contextand 0 otherwise.
Algorithm 2
Context optimization algorithm
Input:
Population size N , number of generations G , tournament size T s , crossoverprobability C p and mutation probability M p . Output:
The evolved population p and the average generation error errors .1: p ← GenerateInitialP opulation ( N ), g ← while g < N do errors [ g ] ← Evaluate ( p )4: p t ← Select ( p, T s )5: p ← CrossMutate ( p t , C p , M p )6: g ← g + 17: end while return p , errors In this way, each individual represents a context, which is a potential pa-rameter to the correction algorithm. To evaluate individuals, the correction al-gorithm was run with the best combination found in the article [7] to each ofthe 451 sentences. The total WER of each context analyzed was returned asa measure of fitness . A simple genetic algorithm described in the algorithm 2was used where the function
GenerateInitialP opulation ( N ) produces N indi-viduals randomly. Evaluate ( p ) is a function that calculates each individual’sWER, assigns it as a measure of fitness and returns the average WER of thepopulation. The selection was made with a simple tournament strategy. Thefunction CrossM utate ( p t , C p , M p ) performs genetic recombination between theindividuals of the population using the random crossing point technique shownin the algorithm 3. Subsequently, the individual-to-individual and gene-to-genemutation process was carried out according to the mutation probability value,which was reduced every ten generations to reduce the fluctuation and stabilizethe error when it approached a minimum. Algorithm 3
Crossing operation between individuals
Input:
Individuals I and I to be combined, a crossing point c i , and a chromosomesize c size Output:
Ancestry of input individuals defined as H and H .1: H ← g g ...g c i g c i +1 g c i +2 ...g c size H ← g g ...g c i g c i +1 g c i +2 ...g c size return H , H
4. Experiments
The present work experiments were carried out using 451 phrases transcribedby the
Google speech recognition system. This same corpus was used in [7], wheredetails of the said corpus’ collection process and format can also be obtained.
To compare the variants of each element of the algorithm, a total of 72experiments were carried out with all the combinations of the methods presentedin table 1.Table 1: Method variants for each element of the algorithm.
Representation Phrases Generation Distance metric Google STT
Simple text WIN Levenshtein BasicIPA LET OSA ContextualDM SYL Damerau-LevenshteinDMV
The text of the phrase recognized by the STT system and the context’sphrases were represented differently for processing. For all text representations,it was necessary to carry out a normalization process to remove some punctuationsymbols and characters. This normalized version was used directly in the formof simple text or transformed into one of the analyzed phonetic representations:International Phonetic Alphabet (IPA), Double Metaphone (DM), or DoubleMetaphone with vowels (DMV).The generation of candidate sentences was experimented with using the pi-vot window methods (WIN) and the incremental comparison according to thenumber of characters (LET) or syllables (SYL).Levenshtein distance and its variants OSA and Damerau-Levenshtein wereused as editing distance metrics. Each combination was tested using as input datathe 451 transcripts obtained by using the basic
Google method and subsequentlythe transcript resulting from sending the context reported in [7] to the
Google service.For each of the 72 different configurations, different confidence thresholds ofthe editing metrics were experimented with increments of 0.05 up to a maximumof 0.6. The evaluation of each experimental setup was carried out using the glo-bally accumulated WER metric calculated from the number of edits required totransform the hypothetical transcript into the correct sentence for each example.
The experimentation described in the previous section was carried out withan empirically generated context according to the a priori knowledge of the do-main phrases where transcription errors were observed. In order to optimize thecontext, a genetic algorithm was run whose parameters were calibrated by con-ducting experimentation of 30 executions with a reduced version of the problemusing a chromosome of size 50, yielding the best results with a population of 50individuals, 100 generations, 95 % probability of crossover and 5 % probabilityof mutation.
Once the parameters had been calibrated, the context was optimized using achromosome size of 355. Each gene represents one of the words or bigrams presentin the transcripts of the audios used in the experimentation. The mutation factorwas reduced by 20 % every 10 generations. Individuals were evolving for 100generations.As a function of fitness , the total WER obtained when executing the co-rrection algorithm for the simple transcription of
Google was used using theindividual defined by the chromosome as context. The phonetic corrector wasrun using IPA representation, pivot window selection, Levenshtein distance, andthreshold of 0.4, which was the best configuration reported in [7].Five evolutionary processes of 100 generations were executed where the po-pulation was initialized with the 25 best individuals from the previous roundand 25 randomly generated to explore different evolutionary variants. The ex-perimentation was done with an Intel i7 processor, 8GB of RAM, and a DebianGNU/Linux operating system. The algorithm was implemented in Python3 andran five times for a total of 70 hours.
In this phase of the experimentation, tests were carried out to measure the ef-fects of context on speech recognition and subsequent correction. The previouslygenerated audio files were sent back to the recognition system in its basic modeand with the genetically generated context. This gave us a direct comparisonof the effect of using the optimized context concerning the transcript obtainedwithout sending context; besides, it gave us two baselines on which to apply thecorrection process to the new transcripts.To compare the correction process results, the algorithm was applied to thenew transcripts produced by both versions of the
Google recognizer. The fourconfiguration variants that presented the best results were used in the experi-mentation described in section 4.1. Taking as input the transcripts obtained bythe two modalities of the
Google service, two experiments were executed for eachone of them. In the first, the experimental context used in [7] was used, which wewill denote as C m , and in the second experiment, the new genetically generatedcontext C g .
5. Results
Fig. 1 (a) shows the variation in the average WER obtained in the expe-riments grouped by the representation mode with different distance thresholds.The horizontal lines correspond to the WER obtained with the transcription ofthe basic STT (33.7 %) and the contextual STT (31.1 %). The effect of redu-cing the WER is observed when transforming the plain text into IPA, especiallyaround the value 0.4 for the threshold, where a minimum average WER of 27.8 %is reached. On the other hand, the representation in DM produced similar re-sults to the other methods for small values of the threshold, however around (a) Text representation (b) Candidates generation
Fig. 1: Average WER for different text representations (a) and candidate gene-ration methods (b)In Fig. 1 (b), the average WER obtained by the different experimentalconfigurations grouped by the candidate phrase generation algorithm is observed.The graph shows a better performance of the incremental comparison variants,either by size in letters or syllables, compared to the pivot window. The minimumaverage WER (29.4 %) is reached with a threshold of 0.4 for the LET method.The SYL version shows very similar results; however, the computational cost ofprocessing is higher.The best results were obtained with the IPA representation configurationand the LET candidate selection method. The WER obtained from the basictranscript decreased from 33.7 % to 28.1 %, and for the contextual transcript, itdecreased from 31.1 % to 27.3 %. This configuration presented a global reductionin the relative WER of 19.3 %. Concerning the three distance metrics evaluated,the difference in the results was practically nil.The results obtained from the experimentation with the genetic algorithmsfor the optimization of the context using the experimental configuration descri-bed in the section 4.2, indicate that an average error was obtained in the fifthexecution of the experiment of 26.5 % , which started with an average error of31.2 % and decreased to an average of 24.9 % in generation 100.In Fig. 2 the average errors of the 100 generations in the fifth run of theexperiment are shown. The best context found with this strategy contained 64unigrams and 117 bigrams with a total WER of 24.7 %. Fig. 2: Graph of the average error per generation of the genetic algorithm (a) Pivot window - Basic STT (b) Character size - Basic STT(c) Pivot Window - Contextual STT (d) Character size - contextual STT
Fig. 3: Results using C m as input to the correction algorithmIn the final phase of the experimentation described in section 4.3, two base-lines are obtained to execute the correction process. The WER obtained whencomparing the actual sentences spoken with the basic transcription was 32.0 %,while incorporating the genetically generated context, the WER was conside- rably reduced to 23.2 %. This result allows us to see the impact that an optimi-zed context has on the language model used by Google by reducing the relativeWER by 27.3 %.Fig. 3 shows the correction algorithm results using the IPA representa-tion with the context C m . Starting from the basic STT, the minimum WERis obtained with a threshold of 0.4 and the LET selection process. With thisconfiguration, the total WER is reduced from 32.0 % to 26.6 %, representing areduction of 16.9 % in the WER relative. When starting the correction processfrom the transcription of the contextual STT, the WER decreased from 23.2 %to 21.0 %, reaching a 9.5 % improvement in the relative WER.Similarly, Fig. 4 shows results using the genetically generated context C g asinput to the correction algorithm. Using this context, a reduction of the minimumWER when using the WIN candidate generation procedure is observed. However,it does not seem to generate good results with the LET method. The absoluteminimum WER for the basic STT is 25.3 %, thus reducing the relative WERof 21.0 %. Starting from the contextual STT, a minimum of 20.1 % is reached,representing a reduction of 13.6 % in relative WER. (a) Pivot Window - Basic STT (b) Character size - Basic STT(c) Pivot window - Contextual STT (d) Character size - contextual STT Fig. 4: Results using C g as input to the correction algorithm
6. Conclusions and future work
From the results obtained in the experimentation, the phonetic correctionalgorithm’s usefulness is shown to reduce errors in the transcription of
Google ,both in its basic and contextual versions. It is observed that the best configu-ration for the algorithm is obtained using IPA as phonetic representation andincremental selection by letters, managing to reduce the relative WER by 19.0 %.Similarly, we can mention that genetic algorithms are an efficient alternativefor generating contexts since they managed to reduce the WER of
Google basictranscription from 32.0 % to 23.2 %. The context was shown to have a crucialvalue in the performance of the algorithm.The best results were obtained from the combination of the phonetic correc-tion with the evolutionary optimization of the context, achieving a reduction ofthe absolute WER of 11.9 % by decreasing it from 32.0 % to 20.1 %, representingan improvement in relative WER of 37.2 %.The fact that both the phonetic correction algorithm and the evolutionarycontext optimization are independent of the system used for the transcriptionand application domain means that the strategy presented can be extended todifferent ASR systems and application domains.The algorithms presented throughout this article can take advantage of apriori knowledge of the application domain to mitigate the cold start problem.The above is because if the initial transcripts are not available, a context ge-nerated with human knowledge of the domain can be used, as in [7], whichcan be complemented with genetic algorithms as information is collected aboutinteractions with actual users of the system.Among future research lines, it is necessary to validate the results with a cor-pus of different application domains; in addition, experimentation using weightedediting costs considers phonetic characteristics of Spanish and the original audiosuch as noise, duration, the energy of the signal, among others. Another lineof research is the comparison with deep learning algorithms since the problemof error correction in ASR systems can be considered a translation of erroneo-us transcripts to correct transcripts, so algorithms Machine translation can behelpful.
Referencias
1. Anjum, A., Sun, F., Wang, L., Orchard, J.: A novel continuous representationof genetic programmings using recurrent neural networks for symbolic regression.CoRR abs/1904.03368 (2019), http://arxiv.org/abs/1904.03368
2. Bae, B., Kang, S.s., Hwang, B.y.: Edit Distance Calculation by Phonetic Rules andWord-length Normalization 2 Related Works 3 Edit Distance for Korean Words 4Phoneme-based Edit Distance. Advances in Computer Science (1), 315–319 (2012)3. Bard, G.V.: Spelling-error tolerant, order-independent pass-phrases via thedamerau-levenshtein string-edit distance metric. In: Proceedings of the Fifth Aus-tralasian Symposium on ACSW Frontiers - Volume 68. pp. 117–124. ACSW’07, Australian Computer Society, Inc., Darlinghurst, Australia, Australia (2007), http://dl.acm.org/citation.cfm?id=1274531.1274545
34. Bassil, Y., Semaan, P.: Asr context-sensitive error correction based on microsoftn-gram dataset. CoRR abs/1203.5262 (2012)5. Becerra, A., de la Rosa, J.I., Gonz´alez, E.: A case study of speech recognition inspanish: From conventional to deep approach. In: 2016 IEEE ANDESCON. pp. 1–4(Oct 2016). https://doi.org/10.1109/ANDESCON.2016.78362126. C., M.M.K.: Phonetic notation. The World’s Writing Systems pp. 821–846 (1996)7. Campos Sobrino, D., Campos Soberanis, M.A., Mart´ınez Chin, I., Uc Cetina, V.:Correccion de errores del reconocedor de voz de google usando m´etricas de distanciafon´etica. Research in Computing Science (In Press) (04 2018)8. Chela-Flores, B.: Consideraciones te´orico-metodol´ogias sobre la adqusici´on de con-sonantes posnucleares del ingl´es. RLA. Revista de ling¨u´ıstica te´orica y aplicada (12 2006). https://doi.org/10.4067/S0718-488320060002000029. Droppo, J., Acero, A.: Context dependent phonetic string edit distancefor automatic speech recognition. In: 2010 IEEE International Conferenceon Acoustics, Speech and Signal Processing. pp. 4358–4361 (March 2010).https://doi.org/10.1109/ICASSP.2010.549565210. Hualde, J.: The sounds of Spanish. Cambridge University Press (2005)11. Kondrak, G.: Phonetic alignment and similarity. Computers and the Huma-nities (3), 273–291 (Aug 2003). https://doi.org/10.1023/A:1025071200644, https://doi.org/10.1023/A:1025071200644
12. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and re-versals. Soviet Physics Doklady (8), 707–710 (feb 1966), doklady Akademii NaukSSSR, V163 No4 845-848 196513. Luo, J., Baz, D.E.: A survey on parallel genetic algorithms for shop schedulingproblems. CoRR abs/1904.04031 (2019), http://arxiv.org/abs/1904.04031
14. Navarro, G.: A guided tour to approximate string matching. ACM Com-put. Surv. (1), 31–88 (Mar 2001). https://doi.org/10.1145/375360.375365, http://doi.acm.org/10.1145/375360.375365
15. Philips, L.: Hanging on the metaphone. Computer Langua-ge Magazine (12), 39–44 (December 1990), accessible at
16. Pucher, M., Turk, A., J., A., Fecher, N.: Distance, phonetic and for, measures andrecognition, speech and optimization, grammar. 3rd Congress of the Alps AdriaAcoustics Association (2007)17. Roman, I., Mendiburu, A., Santana, R., Lozano, J.A.: Sentiment analysiswith genetically evolved gaussian kernels. CoRR abs/1904.00977 (2019), http://arxiv.org/abs/1904.00977
18. Shah, S.: Genetic algorithm for a class of knapsack problems. CoRR abs/1903.03494 (2019), http://arxiv.org/abs/1903.03494http://arxiv.org/abs/1903.03494