Hybrid phonetic-neural model for correction in speech recognition systems
Rafael Viana-Cámara, Mario Campos-Soberanis, Diego Campos-Sobrino
aa r X i v : . [ ee ss . A S ] F e b Hybrid phonetic-neural model for correction inspeech recognition systems
Rafael Viana-C´amara, Mario Campos-Soberanis, Diego Campos-Sobrino
SoldAI Research, Calle 22 No. 202-O, Garc´ıa Giner´es, 97070 M´erida, M´exico { rviana,mcampos,dcampos } @soldai.com Abstract.
Automatic speech recognition (ASR) is a relevant area inmultiple settings because it provides a natural communication mech-anism between applications and users. ASRs often fail in environmentsthat use language specific to particular application domains. Some strate-gies have been explored to reduce errors in closed ASRs through post-processing, particularly automatic spell checking, and deep learning ap-proaches. In this article, we explore using a deep neural network to refinethe results of a phonetic correction algorithm applied to a telesales audiodatabase. The results exhibit a reduction in the word error rate (WER),both in the original transcription and in the phonetic correction, whichshows the viability of deep learning models together with post-processingcorrection strategies to reduce errors made by closed ASRs in specificlanguage domains.
Keywords:
Speech recognition, Phonetic correction, Deep neural net-works.
Although Speech Recognition Systems (ASR) have matured to the point of hav-ing some quality commercial implementations, the high error rate they presentin specific domains prevents this technology from being widely adopted [2].The preceding has led to the ASR correction being extensively studied in thespecialized literature. Traditional ASRs are made up of three relatively indepen-dent modules: acoustic model, dictionary model, and language model [12]. Inrecent times, end-to-end models of deep learning have also gained momentum,in which the modular division of a traditional system is not clear [4]. ASRsin commercial contexts are often distributed as black boxes where users havelittle or no control over the language recognition model, preventing them fromoptimizing using their own audio data. That situation makes post-correctionmodels the paradigm used to deal with errors produced by general-purpose ASRs[3]. In specialized language environments where out-of-vocabulary (OOV) termsare frequently found, contextual word recognition is of utmost importance, andthe degree of customization of the models depends on the ASR’s capabilities toadapt to the context. Different methodologies have been experimented with toperform post-processing correction of closed ASRs, including language modelsnd phonetic correction.This article presents a method for post-processing correction in ASR systemsapplied to specific domains using a Long Short Term Memory (LSTM) neuralnetwork that receives as input attributes, the output of a phonetic correctionprocess, the original transcription of the ASR, and the hyperparameters of thecorrection algorithm. Next, the contribution of neural correction is highlightedfor the generation of a hybrid algorithm that considers both the phonetic cor-rection and its post-correction, which results in an effective strategy to reducethe error in speech recognition.The article is structured as follows: Section 2 describes a background to theproblem and related work; Section 3 presents the research methodology; Section4 describes the experimental work carried out, presenting its results in Section5. Finally, conclusions and lines of experimentation for future work are providedin Section 6 of the article.
The post-correction problem in ASR has been approached from different perspec-tives. In general, we can talk about three different types of errors that occur inaudio recognition: substitution, where a word in the original speech is transcribedas a different word; the second is deletion, in which a word from the originalspeech is not presented in the transcript; and finally, insertion, where a wordthat does not appear in the original speech appears in the transcription [2].There have been several research efforts aimed at correcting ASR errors usingpost-processing techniques; in particular, a significant number of these initiativesinvolve user feedback mechanisms to learn error patterns [2]. Among the strate-gies to learn these error patterns, reducing the problem of ASR post-correctionto a problem spelling mistakes correction has been considered.The article [15] proposes a transformer-based spell-checking model to au-tomatically correct errors, especially those of substitution made by a Mandarinspeech recognition system based on
Connectionist Temporal Classification (CTCEnglish acronym). The project was carried out using recognition results gener-ated by the CTC-based systems as input and the truth transcripts as outputto train a transformer with encoder-decoder architecture, which is very similarto machine translation. Results obtained in a 20,000 hour Mandarin speechrecognition task show that the spell checking model proposed in the article canachieve a Character Error Rate (CER) of 3.41%. This result corresponds to arelative improvement of 22.9% and 53.2 % compared to the baseline systemsthat use CTC decoded with and without a language model, respectively.A versatile post-processing technique based on phonetic distance is presentedin [13]. This article integrates domain knowledge with open-domain ASR results,leading to better performance. In particular, the presented technique is able touse domain restrictions with various degrees of domain knowledge, ranging fromure vocabulary restrictions through grammars or n-grams to restrictions onacceptable expressions.A model of ASR as a noisy transformation channel is presented by Shivaku-mar et al. [12] where a correction system is proposed capable of learning fromthe aggregated errors of all the ASR independent modules and trying to correctthem. The proposed system uses the long-term context by means of a neural net-work language model and can better choose between the possible transcriptionsgenerated by the ASR and reintroduce previously pruned or unseen phrases(that are outside the vocabulary). Provides corrections under low throughputASR conditions without degrading any accurate transcripts; such correctionsmay include out-of-domain and mismatched transcripts. The system discussedin the article provides consistent improvements over the baseline ASR, even whenit is optimized through the restoration of the recurrent neural network (RNN)language model. The results demonstrate that any ASR enhancement can beexploited independently and that the proposed system can still provide benefitsin highly optimized recognition systems. The benefit of the neural networklanguage model is evidenced by the 5-grams use, allowing a relative improvementof 1.9% over the baseline-1.In the article [10] the distortion in name spelling due to the speech recognizeris modeled as the effect of a noisy channel. It follows the IBM translation modelsframework, where the model is trained using a parallel text with subtitles andautomatic speech recognition output. Tests are also performed with a string editdistance based method. The effectiveness of the models is evaluated in a namequery retrieval task. The methods presented in the article result in a 60% F improvement.A noise-robust word embedding model is proposed in [8]. It outperformsexisting commonly used models like fastText [7] and Word2vec [9] in differenttasks. Extensions for modern models are proposed in three subsequent tasks,that is, text classification, named entity recognition, and aspect extraction; theseextensions show an improvement in robustness to noise over existing solutionsfor different NLP tasks.In [1] phonetic correction strategies are used to correct errors generated by anASR system. The cited work converts the ASR transcription to a representationin the International Phonetic Alphabet (IPA) format. The authors use a slidingwindow algorithm to select candidate sentences for correction, with a candidateselection strategy for contextual words. The domain-specific words are providedby a manually generated context and edit distance between their phonetic rep-resentation in IPA format. The authors report an improvement in 30 % of thephrases recognized by Google’s ASR service.In [14], an extension of the previous work is presented, experimenting withthe optimization of the context generated employing genetic algorithms. Theauthors show the performance of variants of the phonetic correction algorithmusing different methods of representation and selection of candidates, as well asdifferent contexts of words genetically evolved from the real transcripts of theaudios. According to the authors, the phonetic correction algorithm’s best per-ormance was observed using IPA as phonetic representation and an incrementalselection by letters, achieving an improvement in relative WER of 19%.The present work explores a neural approach that rectifies the correctionssuggested by a configurable phonetic correction algorithm. Various settings ofthe checker were experimented with using different phonetic representations ofthe transcriptions and modifying other parameters. The corrections proposedby this algorithm are evaluated using a classifier generated by an LSTM neuralnetwork with binary output that indicates whether the correction offered bythe phonetic correction algorithm should be applied. The classifier receives asparameters the original ASR transcript, the correction suggestion offered by thealgorithm, and its hyperparameters calculating a binary output. The previousis done to reduce the number of erroneous corrections made by the algorithm,allowing to improve the quality of the correction in black box ASR approacheswithout the need to access acoustic or language models generated by the originalASR. A corrective algorithm based on the phonetic representation of transcripts gener-ated by the
Google speech recognition system was used. As a source for the tran-scripts, audios collected from a beverage telesales system currently in productionwith Mexican users were employed. The actual transcripts of the examples wereused as a corpus to generate examples with the original ASR transcript, as wellas the proposed correction, labeled in binary form, where 1 represents that theproposed correction should be made and 0 indicates the opposite. For labeling,the WER of the ASR’s hypothetical transcript and the proposed correction WERwere calculated. In both cases, the WER was computed with respect to the realtranscript generated by a human, and it was considered the correction should bemade when the WER of the corrected version is less than the WER of the ASRtranscript. The database was augmented with transcription variants producedby the phonetic checker when used with different parameters. This augmenteddatabase was used to train a classifier generated by an LSTM neural networkwhose objective is to produce a binary output that indicates if the proposedcorrection is recommended.
The sample audios were collected during calls to the telesales system attendedby a smart agent. In these calls, users issued phrases ordering various products indifferent sizes and presentations, as well as natural expressions typical of a salesinteraction, e.g., confirmation or prices. As part of the process, the transcriptionof the user’s voice to text is required for subsequent analysis by the system;for this task, the ASR service of
Google is used. The actual transcription ofthe phrase was carried out employing human agents and served as a baseline toevaluate the hypothetical transcripts of the ASR using the metric
Word ErrorRate (WER), which is considered the standard for ASR [2]. .2 Preprocessing
A text normalization pre-processing was necessary to minimize the effect of lex-icographic differences and facilitate the phonetic comparison between ASR’s hy-pothetical transcripts and actual utterances. The pre-processing included clean-ing symbols and punctuation marks, converting the text to lowercase, convertingnumbers to text, and expanding abbreviations.The initial cleaning stage aims to eliminate existing noise in transcripts andreduce characters to letters and digits. For their part, the last two stages ofpre-processing have the effect of expanding the text to an explicit form thatfacilitates its phonetic conversion, which helps the checker’s performance.
For the development of this research, the phonetic correction algorithm (PhoCo)described in [1, ? ] was used, which consists of transforming the transcribed textto a phonetic representation and comparing segments of it with phonetic repre-sentations of common words and phrases in the application domain for possiblereplacement. These words and phrases are called context . The comparison ismade using a Levenshtein distance similarity threshold that determines whethera correction is suggested or not. Phonetic transcription is a system of graphicsymbols representing the sounds of human speech. It is used as a convention toavoid the peculiarities of each written language and represent those languageswithout a written tradition [6]. Among the phonetic representations used arethe International Phonetic Alphabet (IPA) and a version of worldbet (Wbet) [5]adapted to Mexican Spanish [ ? ]. In the same way, the algorithm allows the useof different candidate selection strategies. For this article, the sliding windowconfigurations (Win) and incremental selection by characters (Let) were used asdescribed in [14]. A neural network was used to discover error patterns in the phonetic correction.The network receives as input the original ASR transcription, the candidatecorrection phrase provided by the PhoCo, together with the algorithm’s hyperpa-rameters. The neural network output is a binary number that indicates whetherthe proposed correction should be made. Neural networks, particularly recurrentones, have been used effectively in text-pattern discovery and classification tasks,so it was decided to model the phonetic correction algorithm’s rectificationprocess using a neural network. The neural network architecture was designed tostrengthen the detection of word patterns and the monitoring of dependencies inthe short and long term, for which a composite topology was generated as follows: – A layer of embeddings of size 128 – One LSTM layer of 60 hidden units
A layer of
Max pooling – A dense layer of 50 hidden units – A dense sigmoid activation layer of 1 unitThe architecture used is illustrated in Fig. 1, which shows the processing ofthe different layers of the network until producing a binary output, by means ofa single neuron with sigmoid activation.
Fig. 1.
Neural classifier model
First, an input layer receives the dictionary indexed representation of thehypothetical phrase from the ASR, as well as the correction suggestion, and anumerical value that indicates the threshold used by the PhoCo to produce itscandidate correction. These inputs are passed to an embeddings layer, whichadds a dense representation of the words that capture syntactic and semanticproperties, which have proven useful in a large number of Natural Language Pro-cessing (NLP) tasks. [11]. Next, the dense representations are sent to an LSTMlayer, which has important properties in long-term dependency managementthanks to its internal update and forget gates, which are extremely useful indetecting sequential text patterns. The
Max pooling layer works like a simplifiedattention mechanism, sampling the dependencies and entities with the highestactivation from the LSTM, promoting the detection of important characteristicsin different positions in the text, which helps to reduce the amount of dataneeded to train the model. It is then passed through a fully connected denselayer of 50 neurons with RELU activations to calculate functions composed ofthe most relevant features sampled from the LSTM. Finally, it is passed to asingle neuron output layer with a sigmoid activation function, as recommendedfor binary classification. A binary cross-entropy loss function was used, and anADAM optimization strategy was chosen to adjust the learning rate adaptively. .5 Hybrid phonetic-neural algorithm
The hybrid algorithm was performed executing the neural correction describedin section 4.3 to the phonetic correction algorithm, presented in section 4.2.This process’s central idea is to provide a control mechanism for the possibleerroneous substitutions that the phonetic correction algorithm could carry out.This approach allows more aggressive correction strategies to be adopted bysetting the threshold of the standard phonetic correction algorithm to a highervalue and controlling possible correction errors (false positives). The algorithmconsists of performing the phonetic correction in a standard way and thenevaluating the candidate correction, together with the original ASR transcriptionand the phonetic algorithm hyperparameters in the neural classifier. If the neuralclassifier predicts a value greater than 0.5, correction is carried out; otherwise,the ASR transcription is used.
This section shows the methods used for the neural classifier training, the ex-perimentation with the classical version of the phonetic correction algorithm,and the hybrid version using the neural classifier’s output as a deciding factorto accept the proposed phonetic correction. The implemented mechanisms areillustrated, as described in section 3 of the document.
A total of 320 audio files were used as the data source for the experimentation.For each audio, two transcripts were generated using Google’s ASR with andwithout context, and those were stored in a database, also containing the man-ually made transcription. Thus, the database contains two ASR hypotheticalphrases generated for each audio and their actual transcription to evaluatethe system. Next, different correction hypotheses were made for each audioexample using various PhoCo configurations. The threshold parameters werevaried between 0.0 and 0.6 with a step of 0.5; the type of representation asIPA, plain text, and Wbet; and the search method selection as sliding windowor incremental character. In this way, 144 possible corrections were generatedfor each audio generating an increased database of 46,080 examples to train theneural classifier. The settings listed in the table are described in cite comia2. Abinary label was added, set to 1 when the proposed correction’s WER is less thanthe WER from the ASR hypothesis and 0 otherwise. Records set to 1 indicatethat the proposed correction positively affects the WER.
Each ASR-produced transcript in the training data was used as a source for acorrective post-processing procedure based on phonetic text transcription. Saidorrection method was used with different variants and parameters. Multipleresults were obtained for each example transcript and recorded in the trainingdatabase augmented with the strategy presented in section 4.1.IThe threshold parameter was varied using a
GridSearch technique in therange from 0 to 0.6 in steps of 0.05. For the representation mode, three variantswere used: IPA, plain text, and Wbet. These variations in the phonetic checkerparameters gave rise to variations in the results that were accumulated in thedatabase.
For the neural classifier training, the augmented database described in section4.1 was divided into random partitions of training, validation, and test in per-centages of 80 % for training, 10 % for validation, and 10 % for testing. Thetraining set was used to generate different models of neural networks, observingmetrics of accuracy, precision, and recall on the training and validation sets, aswell as the area under the curve (AUC) of the Receiver Operating Characteristic(ROC). This metric balances the rate of true and false positives and providesa performance criterion for rating systems. Different models were iterated usingdropout regularization techniques ( dropout ), with different probability parame-ters. Once the best model was obtained in the validation set, it was evaluatedin the test dataset to report the metrics of accuracy, precision, recall, and F presented in section 5.1. The models were implemented using Tensorflow 2.0and Keras, implemented on a Debian GNU/Linux 10 (buster) x86 64 operatingsystem, supplied with an 11 GB Nvidia GTX 1080 TI GPU. The experimentation with the neural phonetic algorithm was carried out oncethe neural classifier had been trained. The individual WER of ASR sentences,the phonetic correction candidates, and the neural phonetic model output werethoroughly examined with all the database examples. The average WER of thesentences is then analyzed for each of the different thresholds used to generatethe phonetic correction. In the results presented in section 5.2, the respectivemean WER is reported, along with the WER relative reductions evaluated withthe original transcript.
This section shows the neural classifier training results, as well as the compar-isons between classic and hybrid versions of the phonetic correction algorithm,illustrating the average WER values obtained from the ASR transcription, thephonetic correction, and the phonetic-neural correction. .1 Neural classifier
The deep neural network was trained for two epochs with a mini-batch techniqueof size 64, using 36,863 examples obtained with the procedures described insections 4.1 and 4.3.In Fig. 2 the graphs of the loss function and the accuracy of the model areshown after each batch’s training. The loss function shows some irregularitiesdue to the different lots’ particularities; however, a consistent decrease in theerror can be seen. In particular, a sharp drop is noted around lot 550 untilit stabilizes near the value 0.1034. A similar behavior occurs with the neuralnetwork’s accuracy, which shows sustained growth, with an abrupt jump aroundlot 550, stabilizing near 0.9646. (a) Loss (b) Accuracy
Fig. 2.
Loss function (a) and accuracy (b) in neural network training
Once the best neural model obtained from the different iteration phases hasbeen trained, its evaluation was carried out by visualizing the area under theROC curve covered by the model when it makes predictions on the validationand test sets. This is illustrated in Fig. 3 where it can be seen that satisfactoryresults were obtained covering 99% of the area. a) AUC validation (b) AUC test
Fig. 3.
Area under the ROC curve for the validation (a) and test (b) sets
With the model trained, accuracy, precision, recall, and F score, were cal-culated using the test set results for the different classes (0 and 1), as well asthe average made with the macro average strategy. High values were obtainedfor all the metrics, exceeding 95% in each of them. The test set consisted of 10% of the total data translated into 4,607 test examples. The values obtained foreach evaluation metric of the neural network are shown in table 1, where the98% value macro average F is particularly striking, this being an indicator ofhigh efficiency for the neural classifier model. Table 1.
Evaluation metrics on the test data set.
Class Accuracy Recall F score Support Macro average 0.98 0.98 0.98 4607
Results of the experimentation described in section 4.3 are presented below.WER averages for different thresholds are from the totality of 46,080 examples,with each threshold value used for experimentation in 3,840 examples. Table 2shows the average WER for the different thresholds and the relative reduction ofthe WER for the phonetic-neural hybrid algorithm. The baseline obtained usingthe
ASR presented a WER of 0.338, so the relative reductions are madetaking that value as a reference.From the results presented, it is observed that in configurations with smallthresholds (0.05 and 0.10), the relative WER to the original phonetic algorithmreduces; therefore, the use of the neural classifier is not a good strategy to carryout the final correction. However, from a threshold of 0.15 onwards, it shows able 2.
Average WER and relative WER of the phonetic corrector (PhoCo) and thehybrid model in relation to the WER of Google’s ASR.
Threshold PhoCo WER Hybrid WER WER rel
Google WER rel
PhoCo
Average 0.247 0.217 36.0% 9.7% a consistent improvement over the original phonetic algorithm, which increasesnotably as the threshold value grows, reaching a maximum when the thresholdis also increasing and reducing relative WER to the standard phonetic versionof 37.9%.The WER relative to the hypothesis provided by
Google ’s ASR shows aconsistent reduction, reaching a maximum reduction of 43.9% with a PhoCothreshold set at 0.45. The hybrid algorithm shows consistent reductions in rel-ative WER for both ASR and straight phonetic transcription, exhibiting anaverage improvement of 36% and 9.7%, respectively. Similarly, the hybrid modelmanaged to obtain the minimum WER with the threshold set at 0.45, reducingthe WER to 0.19, which compared to the average WER of the ASR of
Google ,represents an improvement of 14.8% of the absolute WER and one of 43.9% inrelative terms.
From the results obtained in the experimentation, the hybrid phonetic-neuralcorrection algorithm’s usefulness is shown to reduce errors in the transcriptionof
Google . It is observed that the hybrid algorithm manages to reduce the relativeWER by up to 43.9%.A consistent improvement of the phonetic-neural correction algorithm isshown over both the
ASR transcription and the simple phonetic cor-rection algorithm. An average reduction of the WER of the simple phoneticalgorithm of 9.7% was observed.Deep neural networks were an excellent strategy for modeling language pat-terns in specific domains, exhibiting an F score of 0.98 and 99% area under theROC curve.he neural classifier contributions are more noticeable for higher phoneticcorrection threshold values, allowing more aggressive settings for this correctionalgorithm. Even in schemes where the simple phonetic algorithm reduces itsperformance due to false positive examples, the posterior use of the neuralclassifier is useful to maintain a lower WER compared to the ASR of Google .Those results can be seen in table 2.The phonetic checker is a viable strategy for correcting errors in commercialASRs, reaching a relative WER improvement of 40.7% with a threshold of 0.40.With the application of the neural classifier and the hybrid algorithm, it ispossible to further reduce the WER using a 0.45 PhoCo threshold, achieving animprovement in the relative WER of 43.9%. These improvements are relevant incommercial use ASRs, where even higher degrees of precision are needed.Since the correction architecture is independent of the system used for tran-scription and the application domain, the described strategy can be extended todifferent ASR systems and application domains. However, it is necessary to traina neural classifier for each of the different domains, so this approach cannot beused for knowledge transfer.The results show that it is possible to implement a phonetic-neural hybridstrategy for ASR post-correction near real-time. Since both the phonetic cor-rection algorithm and the neural classifier are computational models susceptibleto scaling, web services integration techniques can be used to perform post-correction in existing commercial ASR systems.Among future research lines, it is to validate the results with a corpusof different application domains and experimentation using different phoneticcorrection parameters, including the context and the incorporation of originalaudio characteristics. Another foreseeable research line is the comparison withend-to-end deep learning algorithms, where a deep neural model generates theASR correction directly.
To Carlos Rodrigo Castillo S´anchez, for his valuable contribution in providingthe infrastructure for this article’s experimentation.
References
1. Campos-Sobrino, D., Campos-Soberanis, M., Mart´ınez-Chin, I., Uc-Cetina, V.:Correcci´on de errores del reconocedor de voz de google usando m´etricas de distanciafon´etica. Research in Computing Science , 57 – 70 (2019)2. Errattahi, R., Hannani], A.E., Ouahmane, H.: Automatic speech recogni-tion errors detection and correction: A review. Procedia Computer Science , 32 – 37 (2018). https://doi.org/https://doi.org/10.1016/j.procs.2018.03.005, , 1stInternational Conference on Natural Language and Speech Processing. Feld, M., Momtazi, S., Freigang, F., Klakow, D., M¨uller, C.: Mobile texting: Canpost-asr correction solve the issues? an experimental study on gain vs. costs.International Conference on Intelligent User Interfaces, Proceedings IUI (05 2012).https://doi.org/10.1145/2166966.21669744. He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., Ry-bach, D., Kannan, A., Wu, Y., Pang, R., Liang, Q., Bhatia, D., Shangguan, Y., Li,B., Pundak, G., Sim, K.C., Bagby, T., Chang, S., Rao, K., Gruenstein, A.: Stream-ing end-to-end speech recognition for mobile devices. CoRR abs/1811.06621 (2018), http://arxiv.org/abs/1811.06621
5. Hieronymus, J.L.: Ascii phonetic symbols for world’s languages: worldbet. Techni-cal report, Bell Labs (1993)6. Hualde, J.: The sounds of Spanish. Cambridge University Press (2005)7. Joulin, A., Grave, E., Bojanowski, P., Douze, M., J´egou, H., Mikolov, T.: Fast-text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651(2016)8. Malykh, V.: Robust to noise models in natural language processing tasks. In:Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics: Student Research Workshop. pp. 10–16. Association for ComputationalLinguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-2002,
9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space (2013)10. Raghavan, H., Allan, J.: Matching inconsistently spelled names in au-tomatic speech recognizer output for information retrieval. (01 2005).https://doi.org/10.3115/1220575.122063211. Ruder, S.: A survey of cross-lingual embedding models. CoRR abs/1706.04902 (2017), http://arxiv.org/abs/1706.04902
12. Shivakumar, P.G., Li, H., Knight, K., Georgiou, P.G.: Learning from past mistakes:Improving automatic speech recognition output via noisy-clean phrase contextmodeling. CoRR abs/1802.02607 (2018), http://arxiv.org/abs/1802.02607
13. Twiefel, J., Baumann, T., Heinrich, S., Wermter, S.: Improving domain-independent cloud-based speech recognition with domain-dependent phonetic post-processing. vol. 2, pp. 1529–1535 (07 2014)14. Viana-C´amara, R., Campos-Sobrino, D., Campos-Soberanis, M.: Optimizaci´onevolutiva de contextos para la correcci´on fon´etica en sistemas de reconocimientodel habla. Research in Computing Science , 293 – 306 (2019)15. Zhang, S., Lei, M., Yan, Z.: Automatic spelling correction with transformer forctc-based end-to-end speech recognition. ArXiv abs/1904.10045abs/1904.10045