[PDF] Arabic Speech Recognition by End-to-End, Modular Systems and Human

Abstract

Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.

Full PDF

AArabic Speech Recognition by End-to-End, ModularSystems and Human

Amir Hussein a,b, , Shinji Watanabe c , Ahmed Ali a a HBKU, Qatar Computing Research Institute Doha, Qatar b Kanari AI, Pasadena, California c Carnegie Mellon University

Abstract

Recent advances in automatic speech recognition (ASR) have achieved accu-racy levels comparable to human transcribers, which led researchers to debateif the machine has reached human performance. Previous work focused onthe English language and modular hidden Markov model-deep neural network(HMM-DNN) systems. In this paper, we perform a comprehensive benchmark-ing for end-to-end transformer ASR, modular HMM-DNN ASR, and humanspeech recognition (HSR) on the Arabic language and its dialects. For theHSR, we evaluate linguist performance and lay-native speaker performance ona new dataset collected as a part of this study. For ASR the end-to-end workled to . %, . % , . % WER; a new performance milestone for the MGB ,MGB , and MGB challenges respectively. Our results suggest that human per-formance in the Arabic language is still considerably better than the machinewith an absolute WER gap of . % on average. Keywords:

Dialectal Arabic, End-to-end speech recognition, Human speechrecognition, Modern Standard Arabic, Transformer

1. Introduction

Automatic Speech Recognition has shown fast progress recently, thanks toadvancements in Deep Neural Network (DNN) which has brought remarkable

Email addresses: [email protected] (Amir Hussein), [email protected] (Shinji Watanabe), [email protected] (Ahmed Ali)

Preprint submitted to Journal of Computer Speech and Language January 21, 2021 mprovements in reaching human-level performance [1]. Traditional ASR sys-tems employ a modular design, with different modules for acoustic modeling,pronunciation lexicon, and language modeling are trained separately. More re-cently, end-to-end (E2E) models are trained to convert acoustic features to texttranscriptions directly, potentially optimizing all parts for the end task [2]. Withthe performance of ASR systems reaching closer to that of a human, several ef-forts embarked to benchmark the performance of state-of-the-art ASR systemsagainst professional transcribers [3, 4]. In [1] the researches showed that E2EASR can achieve competitive performance in simple speech recognition taskslike reading newspaper. After that, a study by Microsoft [3], suggested thatthe ASR systems have already reached the level of the professional human tran-scriber in more difficult ASR tasks like conversational speech. On the otherhand, a study by IBM [4] suggested that human parity in conversational speechis still considerably better. The aforementioned studies were conducted on En-glish language and were not explored in morphology complex languages likeArabic. The Arabic language is the largest Semitic language with a high degreeof affixations and derivations, which result in a huge increase in the number ofword forms. Remarkably, the 400 million speakers (estimated in 2020) Arabicnative speakers use Dialectal Arabic (DA) as their way of communication in theday-to-day speech. DA does not have standard orthographic rules. It can beargued that a language is a dialect with an army and navy [5]. If we take thisperspective into consideration, we can describe the different Arabic dialects asdifferent languages. However, Arabs in general perceive dialects as a deterio-ration from the classical Arabic, almost using all the same Arabic letters. Anobjective comparison of the varieties of Arabic dialects could potentially lead tothe conclusion that Arabic dialects are historically related, but not synchron-ically, and are mutually unintelligible languages like English and Dutch. Thismakes Arabic an excellent choice to highlight the challenges of speech recogni-tion in the wild.Till today, the state-of-the-art ASR in the Arabic language comes from mod-ular Hidden Markov Model Deep Neural Network (HMM-DNN) modular sys-2ems. Lately, the best ASR results on the Modern Standard Arabic (MSA) datawere reported by the Aalto University team [6], with WER of 13.2% on MGB2test set using a complex modular system. There are three major challengeswhen developing speech recognition models for the Arabic language:• Arabic is a consonantal language with most of the available text is non-diacritized. As a result, it is challenging to determine the location of thevowels, which can convey different meanings.• Existence of different Arabic dialects with limited labeled data. Eachdialect is a native Arabic language that is spoken, but not written, as itdoes not have standardized orthographic rules.• Arabic morphological complexity with a high degree of affixation andderivation that makes it challenging to estimate probabilities for the lan-guage model and increases the out-of-vocabulary (OOV) rate.Several attempts have been made to address each of the aforementionedchallenges. To address the nondiacritized words ambiguity, researchers in [7]utilized sequence-to-sequence deep learning model, inspired by Neural MachineTranslation, to restore the missing diacritics. The proposed approach achievednew state-of-the-art with word error rate of 4.49%. To address the challenge oflimited dialect speech data, several transfer learning approaches were proposed[8, 9] that utilize similarity between MSA and dialectal Arabic (DA) speech.Finally, to deal with morphological complexity and OOV, the character-levellanguage model (LM) was suggested by [10]. However, the major limitationwith character level LM is that it is difficult to capture the contexts of the wordin a sentence. Furthermore, the previous approaches mainly used HMM-DNNmodels that have several limitations including model complexity, conditionalindependence assumption and requirement of linguistic resources.The core contribution of this study lies in a new comprehensive analysiscomparing the E2E transformer ASR, the modular HMM-DNN ASR, and theHSR. To avoid biases in our analysis, we collected a new evaluation set of 33ours containing news reports and conversational speech with both MSA andDA. To better understand human performance, we hired expert linguists andeducated native speakers to perform HSR task on the new hidden set. To thebest of our knowledge, this is the first work that compares head to head singleE2E, modular, and HSR performance. The main question that we address inthis work is whether there are major qualitative differences between the HSRand the state-of-the-art machine results in the Arabic ASR as shown in Figure 1.In this work, we develop the first E2E transformer ASR for the Arabic language.Furthermore, we provide the best practices for finetuning the transformer ASRfor dialectal Arabic . The proposed approach advances the Arabic ASR by Figure 1: High-level illustration of the core study conducted in this paper. The source code has been made publicly available on Espnet Github repositoryhttps://github.com/espnet/espnet/tree/master/egs/mgb2/asr1 .In summary, the key contributions of our work include:• A comprehensive assessment of the human performance on Arabic speechrecognition system, analyzing the type of errors and correlation with themachine.• A new milestone of the Arabic speech recognition performance with E2Etransformer architecture for MSA and DA tasks.• As a part of HSR versus machine study, we provide a new hidden test setcombining both MSA and DA to avoid being biased in our analysis, asprevious test sets have been made public.An additional contribution of our work is developing a voice activity detection(VAD) pipeline for the E2E transformer to address the problem of very longspeech segments, which is a typical situation in a practical ASR setup. The de-veloped pipeline combines the speech detection precision of InaSpeechSegmenter[13], with the maximum length of the segmentation feature of energy-basedVAD[14]. Finally, we plan to provide a new benchmark, including the state-of-the-art recipe, and pre-trained models and make them publicly available for thecommunity. In case of acceptance, the code will be made publicly available as part of Espnet officialrecipe

2. Related Work

This section highlights prior work in ASR approaches, which can be groupedinto modular HMM-DNN systems and E2E. In addition, we present a subsectionabout the studies that conducted comparisons between human and machineperformance on speech transcription.

For a long time, the HMM with Gaussian Mixture Model (GMM) has beenconsidered the mainstream model for large vocabulary continuous speech recog-nition (LVCSR) achieving best recognition results. In [15], authors proposedthe first HMM-DNN hybrid approach, where GMM was replaced with the deepneural network model. It achieved significant performance gains compared tothe HMM-GMM legacy system in the LVCSR task. After that, several DNN ar-chitectures were explored for the acoustic modeling, including Recurrent NeuralNetworks (RNNs), Bidirectional RNNs (BDRNNs), and deep conditional ran-dom fields, which showed prominent performance improvement [16, 17]. In [18],the authors explored using dropout to improve generalization in DNN training.They reported that combining a time delay neural network (TDNN) with longshort term memory (LSTM) layers outperformed bidirectional LSTM (BLSTM)acoustic modeling. In [12], researchers targeted addressing MSA task using sim-ilar architecture to [18], combining three AM models trained with Lattice-freeMaximum Mutual Information (LF-MMI) objective function. They combinedTDNN, LSTM, and BLSTM using the Minimum Bayes Risk (MBR) decoding6riterion and achieved first place in the MGB2 Arabic Broadcast Media Recog-nition challenge with a WER of 14.2%. After that, in MGB-3 challenge [19] ateam from Aalto University combined over 30 systems using MBR, includingtwo acoustic models (TDNN-BLSTM) and a variety of language models (char-acter, sub-word, and word-based) [6]. Aalto team achieved a WER of 13.2% onMGB2 and 37.5% on MGB3, which is to the best of our knowledge, the state-of-the-art results. The main disadvantage of the modular HMM-DNN system isits complexity consisting of various components, including the acoustic model,language model, and pronunciation model. Each component of the system isoptimized independently with a different objective function, which usually leadsto a local optimum. The HMM-DNN systems often use conditional indepen-dence assumptions (Markov assumptions), which does not match the real-worlddata. In addition, HMM-DNN systems require linguistic resources, such ashandcrafted pronunciation dictionary that is subject to human error.

E2E deep learning models were introduced to simplify the complexity ofHMM-DNN modular models into a single deep network architecture and ad-dress the aforementioned limitations. The main issue with E2E sequence tosequence models was data alignment [20]. Several approaches were introducedto address the data alignment problem, including connectionist temporal classi-fication (CTC) based model [2] and attention-based sequence to sequence model[21, 22]. The E2E CTC approach uses Markov assumptions to efficiently solvesequential problems with dynamic programming. However, the main limitationof CTC is the assumption that the label outputs are conditionally independentof each other. On the other hand, the attention-based E2E model uses an at-tention mechanism to perform alignment between acoustic frames and label se-quence without the independence assumption, yet it can result in nonsequentialalignments. In [23], the authors proposed a multitask CTC/attention approach,which effectively utilizes the advantages of both architectures in training anddecoding. The proposed multitask CTC/attention approach is based on long7hort term memory (LSTM) sequence to sequence model. In the literature, wefound only two works that are related to E2E models for Arabic speech data[10, 24]. In [10] the authors proposed an E2E model based on Bidirectional Re-current Neural Networks (BRNN) with CTC. However, the results in [10] wereonly reported on the MGB2 developments set with no results on the test set. In[24], the researchers studied the learned internal representations in DeepSpeech2E2E [1] on phonemes and graphemes classification tasks. The transformer-basedarchitecture was first introduced as a neural machine translation system [25] toreplace the recurrence with the self-attention mechanism. In [26, 27], authorsshowed superiority performance of transformer-based models compared to state-of-the-art recurrent networks in ASR task. Besides, the E2E transformer-basedASR showed close performance to the state-of-the-art HMM-DNN systems [28].In this paper, we develop an E2E transformer-based ASR model with a multi-task CTC/attention objective function for Arabic speech data.

Advances in ASR technology produced major improvements reaching therange of human performance. In two papers by Microsoft [3, 29], authors sug-gested that the machine has already reached the human performance in En-glish ASR. Microsoft reported that their improved ASR outperformed experttranscriber by 0.1% and 0.2% WER on Switchboard and CallHome datasets,respectively. In their study, an existing Microsoft transcription pipeline wasleveraged, in which transcription is conducted on a weekly basis. The transcrip-tion was conducted on NIST 2000 test set with two passes. In the first pass, atranscriber works from scratch to transcribe the data and, in the second pass,a second listener monitors the data to do error correction. On the other hand,IBM [4] conducted an independent set of human performance measurements toverify the aforementioned claims by Microsoft. IBM found that human perfor-mance is considerably better outperforming their state-of-the-art ASR by 0.4%and 3.5% WER on Switchboard and CallHome datasets, respectively. Unlikethe previous Microsoft setup, IBM experiment transcribers were aware of the8xperiment and were actively involved. Three independent transcribers wereused for the first pass in addition to quality control by a fourth senior tran-scriber. The final performance was chosen based on the lowest transcriber worderror rate (WER). Unlike previous studies where several systems were combinedin the favor of the machine, in this paper, the comparison is conducted versushuman with a single machine system at a time. Furthermore, in this study, bestpractices of the two aforementioned studies are followed to set the most realisticand unbiased comparison between human and machine as described in Section4.1.

3. ASR Models Description

The acoustic model for the modular system is trained using the 1,200 hoursMGB2 [30]. For language modeling, we use the 130M words crawled fromthe Aljazeera Arabic website from the period 2000 - 2011, as provided for theMGB2 challenge. LM experiments used a grapheme lexicon of 1.3M words. Thegrapheme-based lexicon has a 1:1 word-to-grapheme mapping, which means thevocabulary size is the same as the lexicon size.

Acoustic modeling : In this study, we adopt the architecture proposed by[18], which consists of combining a time delay neural network (TDNN) withlong short term memory (LSTM) layers and showed significantly better resultscompared to bidirectional LSTM (BLSTM) acoustic modeling. The TDNN-LSTM model consists of 5 hidden layers, each layer containing 1,024 hiddenunits. Neural networks are trained using lattice-free maximum mutual informa-tion (LF-MMI) [31]. Acoustic models are built using Kaldi ASR toolkit [32].

Language modeling : Two n -gram LMs are trained: a big four-gram LM(bLM ), trained using the spoken transcripts and the 130M words backgroundtext; and a smaller four-gram LM obtained by pruning bLM using pocolm . https://github.com/danpovey/pocolm .In an attempt to be consistent with the E2E system, we explored using the time-restricted self-attention layer for the acoustic modeling as described in [33]. Wealso investigated using the sub-word tokenization [11] for language modeling.However, the impact on WER was small, therefore, we decided to adopt theTDNN-LSTM architecture as this is the state-of-the-art for a modular system.Throughout the paper we refer to this modular system as HMM-DNN. Transformer is a recent sequence to sequence model that completely replacedthe recurrence in traditional recurrent networks with self-attention mechanismand sinusoidal position information. In this paper, we utilize transformer-basedarchitecture for Arabic ASR, as shown in Figure 2. On a very high level, thetransformer consists of an encoder model with M repeating encoder blocks anda decoder model with N repeating decoder blocks. The encoder model mainlymaps the input vector to a latent representation. The input to the encoder X is asequence of 83-dimensional feature frames, 80-dimensional log Mel spectrogramwith pitch features [34]. The decoder generates one prediction at a time in anauto-regressive fashion. At each time step, the input to the decoder model is thelatent representation for the encoder model and previous decoder predictions. First, the acoustic feature frames X are transformed into sub-sampled se-quence X s ∈ R d sub × d model with 2D-CNN sampling layer. The d sub is the lengthof the output sequence and d model is the number of input feature dimensions tothe Encoder. Each encoder block consists of sub-layers: a multi-head self-attention mecha-nism and a position-wise fully connected network. The output of each sub-layer10 igure 2: Illustration of E2E transformer-based ASR architecture. The input to the encoderduring the training is a sequence of feature frames. The output of the encoder is fed as inputto the decoder in addition to the masked target transcription. The output of the decoder isthe prediction of the masked transcription is followed by a layer normalization [35] with a residual connection from thesub-layer input [36]. The input to the first encoder block is the sub-sampledsequence input X s . At the self-attention sub-layer, the X s is transformed toqueries Q = X s ∗ W q , keys K = X s ∗ W k , and values V = X s ∗ W v , where W q and W k ∈ R d model × d k and W v ∈ R d model × d v are learnable weights. The d model is the output dimension from the previous attention layer, and d v , d k = d q arethe dimensions of values, keys and queries. After that, a normalized weightedsimilarity Z from self-attention is obtained with a softmax as shown in Eq 1. Self Attention ( Q , K , V ) = sof tmax ( Q ∗ K T √ d k ) ∗ V (1)11o deal with multiple attentions, multiple attention sub-layers are used in par-allel, usually referred to as multi-head attention (MHA). The MHA is obtainedby concatenating all of the self-attention heads at a particular layer.MHA ( Q , K , V ) = [ Z , Z , · · · , Z h ] W h Z h = SelfAttention ( Q h , K h , V h ) (2)where h is the number of attention heads in a layer. The output from MHA isnormalized and then fed to the Feed Forward (FF) sub-layer connected networkwhich is applied to each position separately.FF ( z [ t ]) = max (0 , z [ t ] ∗ W + b ) W + b (3)where z [ t ] is the t th position of the input Z to the FF sub-layer. The decoder architecture is very similar to the encoder. However, in additionto MHA self-attention and fully connected sub-layers, it has a third masked self-attention layer. The masked self-attention in the decoder is allowing to attendonly to earlier positions in the output sequence. The decoder prediction ˆ Y [ t ] ateach time step is conditionally dependent on the final representation producedby the encoder H e and the previous target sequence Y [1 : t − . Similar to theencoder, the decoder has residual connections and layer normalization aroundeach sub-layer. Positional encoding is added to the input embeddings to reflect the positionalcontext of each word in the sentence. Transformers use sinusoidal positionalencoding with different frequencies as shown in Equation 4.

P E ( n, i ) = sin (cid:16) n /10000 i / d model (cid:17) P E ( n, i +1) = cos (cid:16) n /10000 i / d model (cid:17) (4)where n is the position of a word in the sentence and i is the dimension.12 .7. Transformer ASR training During training, the acoustic model predicts the posterior probability of thetranscription Y given acoustic features X . The total objective function of theacoustic model L asr is a multi-task learning objective that combines the E2Edecoder loss L d = − log [ P d ( Y | X )] and the CTC loss L ctc = − log [ P ctc ( Y | X )] .This multi-objective function was proposed to improve model robustness andfaster convergence [23]. L asr = α L ctc + (1 − α ) L d (5)where P d are the probabilities predicted by transformer decoder, P ctc are theCTC probabilities, and α is a weighting factor that trades-off the two losses. During the inference mode, a language model (LM) is used to disambiguatebetween hypothesised words generated by the decoder. In particular, two typesof LM models are used in this paper: long short term memory (LSTM) andtransformer-based language model (TLM). The LM model prediction is com-bined with E2E as shown in Eq. 6. ˆ Y = argmax Y ∈ Y ∗ { λ L ctc + (1 − λ ) L d + µ L lm } (6)where L lm = P lm ( Y ) is the language model prediction, Y ∗ is a set of hypothesisof the target sequence, µ and λ are trade-off factors. The audio segmentationduring the inference in an experimental setup is usually assumed to be preparedby an expert transcriber. In Section 5.6, we study the effect of segment durationvariability on E2E transformer performance.

4. Experimental Setup

The proposed E2E transformer approach is benchmarked with state-of-the-art approaches on MSA task with (MGB2) data and on dialectal task with(MGB3, MGB5) data. For MSA evaluation, the conventional word error rate13WER) is used. However, for dialectal data evaluation, the multi-referenceword error rate (MR-WER) and averaged WER (AV_WER) are adopted fromMGB3, MGB5 challenges [19, 37]. The MR-WER was proposed to evaluatedialectal data, which does not have standardized orthography [38]. All modelsare implemented using Espnet toolkit [39]. We ran our experiments on an HPCnode equipped with 4 NVIDIA Tesla V100 GPUs with 16 GB memory, and 20cores of Xeon(R) E5-2690 CPUs.

Two professional linguist transcribers were hired independently and theirquality transcription was checked by a third senior transcriber with extensiveexperience in the linguistic annotation. In addition, another three educated na-tive Arabic speakers (not linguists) were hired to transcribe the same data. Thetranscribers were not aware of the experiment conducted and they performedthe transcription as a part of their daily transcription tasks. The same guide-lines were provided to all transcribers to ensure the consistency and the qualityof the transcription, which included guides for audio segmentation, truncatedwords, words from other languages, hesitation, etc. There were no restrictionson the number of times to listen to the speech. We found that, on average, tran-scribers needed to listen between 2-4 times for each sentence. As part of ourstudy, hours of Arabic speech data from Aljazzera news channel in the periodfrom (June - August) 2020 was collected. The data included a variety of conver-sations, interview programs and reports by journalists from the field. Around10% of the data is in Dialect Arabic, including Egyptian, Gulf, Levantine, andNorth African. This dataset is also used as a final hidden test (hereafter referredto as Hidden_Test ) as it was not seen by any model before. As part of our ongoing effort, this dataset will be publicly accessible on codalab for bench-marking as the final hidden test set for Arabic ASR challenge. .2. Model development data In this work, the Arabic Multi-Genre Broadcast (MGB2) corpus [30] wasused for model training. Around 70% of the data is considered Modern StandardArabic (MSA), with the rest in Dialectal Arabic including Egyptian (EGY),Gulf (GLF), Levantine (LEV), and North African (NOR). The dataset spanis more than 10 years recording, during 2005-2015, from 19 distinct programs,and contains around 1,200 hours. The programs include conversations (63%),interviews (19%), and reports (18%). The conversational speech is the mostchallenging because it includes overlapping speech with multiple dialects. Allprograms were aligned using the QCRI Arabic LVCSR system [40], which isgrapheme-based with one unique grapheme sequence per word. Moreover, thedataset includes a large corpus of background text that can be used to builda language model. The text corpus consists of over 130 million words crawledfrom the Aljazeera.net website. BuckWalter mapping format is used for thetranscriptions and the background text data. More details about the data canbe found in Table 1. In this study, the proposed E2E transformer ASR is benchmarked on twodialectal real-world datasets; the Egyptian MGB3 [19] and the Moroccan MGB5[37]. The MGB3 dataset comprises of 16 hours of speech obtained from 80YouTube videos, while MGB5 consists of 13 hours of speech extracted from 93YouTube videos. Both datasets are distributed across seven genres: comedy,cooking, family/kids, fashion, drama, sports, and science talks (TEDx).

The raw audio segments were first augmented with the speed perturbationapproach, which increased the original signal by a factor of three with speed Buckwalter is a one-to-one mapping allowing non-Arabic speakers to understand Arabicscripts, and it is also left-to-right, making it easy to render on most devices. able 1: Arabic datasets description used in this study. Dataset Type Hours Programs

Training 1200 2,214 370K

MGB2

Development 10 17 5002Evaluation 10 17 5365Adaptation 4.6 23 2202

MGB3

Development 4.8 24 2181Evaluation 6 30 5746Adaptation 10.2 69 31063

MGB5

Development 1.3 10 1129Evaluation 1.4 14 1055

Hidden_Test

Evaluation 3 7 1404 factors of 0.9, 1.0 and 1.1 [41]. Each augmented audio was transformed toa sequence of 83-dimensional feature frames for the E2E model, and an 80-dimensional log Mel spectrogram with pitch features [34]. In addition, theresulted mel-spectrogram features were augmented with specaugment approach[42], which warps the data in time direction by masking blocks of consecutivefrequency channels and masking blocks of utterances in time. As for the textdata for the language model development, two sources were considered: thetranscription text and the background text of 130 million words . The datawas cleaned by removing punctuations, diacritics, extra empty spaces, newlines,and single-character words. To overcome the problem of very long sequences,the text was segmented to contain a maximum of 200 words with an overlap of50 words [43]. The sub-word model [11] was used to tokenize the input text andprepare the vocabulary. .5. Default Model Hyperparameters All hyperparameters were obtained using a grid search. The parameterstuning was performed on a small subset of MGB2 data (250 h). The E2Etransformer-based ASR model was trained using Noam optimizer [25] with alearning rate of 5. The best values for multi-objective tradeoff weights: α inEquation 5, µ and λ in Equation 6 were found to be 0.3, 0.3 and 0.5, respectively.Table (2) summarizes the best set of parameters that were found for AM andLM transformer architecture . As for LSTM LM, the best results were obtainedwith 2 layers and 650 units/layer. The LSTM LM was trained with a batch-sizeof 512 and a stochastic gradient descent algorithm with a learning rate of 1. Table 2: Values of tuned hyperparameters for E2E AM transformer and LM transformerobtained from grid search.

AM Hyperparameters LM Hyperparameters

Input batch-bins: batch-size: Encoder layers, attention heads / layer layers, attention heads / layerDecoder layers, attention heads/layer layers, attention heads/layer d model (attention)

512 512

FFN

5. Results and Discussion

The developed E2E transformer is benchmarked with the state-of-the-artmodular system; [12], Aalto system [6] and the expert linguist. Table 3 summa-rizes the WER results on the MSA datasets: the MGB2 and the Hidden_Testset. It can be seen that the proposed E2E Transformer ASR outperforms thesingle HMM-DNN and Aalto systems by 20% and 5% in relative WER, re- The source code to reproduce the results is made publicly available on Espnet Githubrepository https://github.com/espnet/espnet/tree/master/egs/mgb2/asr1 able 3: WER% performance of E2E transformer, HMM-DNN and state-of-the-art Aaltoapproach and expert linguist. HMM-DNN Aalto [6]

E2E Transformer Expert linguistMGB2 - Hidden_Test spectively. On the other hand, results on the Hidden_Test illustrates that themachine error rate is worse than the expert linguist by around 4%.

In this section, we analyze in more details the type of errors and correlationbetween the expert linguist (Linguist), native speaker (Native), E2E transformerand HMM-DNN ASR systems. Figures (3,4) illustrate the inter-annotationdisagreement on the Hidden_Test data for raw and normalized text. It canbe seen that the inter-annotation disagreement between the two linguists ison average 11.45%. After analyzing these results, it has been found that theinter-annotation disagreement between the two linguists is mainly caused bythe variability in transcribing dialectal speech, which does not have orthogra-phy standards. For the comparison, we define the inter-annotation disagreementgap G () (hereafter referred to as gap) between member a i a member of groups A and group B as G ( a i , B ) = J + K (cid:80) Jj =1 (cid:80) Kk =1 abs ( disag ( a i , b j ) − disag ( b j , b k )) ∀ j (cid:54) = k , where disag () is the inter-annotation disagreement. One can observefrom Figure 3 that the gap of inter-annotation disagreement between our bestE2E transformer ASR and the expert linguist is on average 3.55%. This is thegap that the machine needs to overcome to achieve the expert linguist tran-scription performance. In addition, the inter-annotation disagreement betweenthe machine (E2E transformer, HMM-DNN) and the linguist is more than 5%lower than the native educated transcriber. We observed that the expert lin-guists tend to pay more attention than Native speakers to linguistic mistakesespecially with Alif/Ya/Ta-Marbuta, which are common mistakes in the Arabic18 igure 3: Confusion matrix of inter-annotation disagreement from raw transcription text. Therows represent hypothesis and columns represent corresponding reference. language.Next, the disagreement was reduced with common Alif/Ya/Ta-Marbuta nor-malization. The normalization removes distinctions within three sets of char-acters that are often written inconsistently in DA and sometimes in MSA: Alifforms ( A = ا , > = أ, < = إ , | = آ) , Ya forms ( y = ي , Y = ى), and Ta-Marbutaforms ( p = ة , h = ه). The disagreement of normalized transcription between the lin-guist and the Native was significantly reduced by up to 18% absolute, which shows thatalmost half of the disagreement between the linguist, and the Native is due to linguisticmistakes. On the other hand, the normalization has almost no effect on the E2E trans-former transcription and the inter annotation disagreement gap between the linguistand the E2E transformer increased from 3.6% to 4.75%. It can be noticed that even igure 4: Confusion matrix of inter-annotation disagreement from normalized transcriptiontext. after the transcription normalization, the gap between Native speaker and the linguist,4.8%, is still slightly higher than the gap between E2E transformer and the linguist.Furthermore, we take a closer look at the most frequent top ten errors in termsof substitutions, deletions, and insertions made by the linguist and E2E transformer,as shown in Tables 4 and 5. Similar errors from both the E2E transformer and thelinguist are highlighted with the same color. Inspections revealed that the top errorsmade by both human and machine are substantially similar, especially the insertionsand deletions. Looking at the substitutions, one can notice that most of the errors forboth the linguist and E2E are on dialectal words. The main difference between E2Eand the linguist is that the E2E tends to make more errors on rare Arabic words likenames. The insertion and deletion patterns are similar for both the linguist and E2E: able 4: Most common substitutions for E2E ASR system and expert linguist. The numberof times each error occurs is followed by the word in the reference and the correspondinghypothesis. E2E Transformer linguist

BW Translation BW Translation12: nh it is 11: nh it is12: AldEwY / AldEwp lawsuit 6: btAEh / btAEt possessed7: AljAbry / Aljbr Aljabry (name) 7: lk / lky your6: dyAb / AldyAb Dyab (name) 6: nHnA / nHn we5: btAEh / btAEt possessed 4: tfDl / AtfDl go ahead5: nHnA / nHn we 4: fy / fyh in4: Ally / Alty which 4: l>n / >n to4: AlwA$nTn / wA$nTn Washington 3: AlmAyh / AlmwyA water4: l>n / l}n to 3: ln / lm did/will not3: lAzAlt / zAlt still 3: bdt / bd>t started prepositions are the most frequent errors. It can be seen that the deletion rate of E2Ecompared to the linguist is somewhat higher than the substitution and insertion rates.This makes sense as the linguist is more careful in transcribing everything that is beingheard.

In this section, the results of both the HMM-DNN and the E2E transformer are ana-lyzed, pointing out the advantages and disadvantages of each. The correct transcriptionin the examples are highlighted in green and the corresponding errors are highlightedin pink. • Dialectal Arabic : in the case of dialects and overlapped speech, the E2E generatesmore accurate transcriptions. The E2E transformer is able to learn the contextmuch better with the self-attention mechanism and capture the semantics in boththe standard Arabic structure as well as different Arabic dialects.

REF : التقسيطفيشيءسلبيوفيشيءإيجابيهلأفيشيءضرورياتمثلاللبيت able 5: Most common insertion and deletions for E2E ASR system and expert linguist. Insertions DeletionsE2E Transformer linguist E2E Transformer linguist

BW Translation BW Translation BW Translation BW Translation20: >n that 16: fy in 42: yEny means 21: nEm yes18: >w or 12: mn from 39: nEm yes 17: fy in13: Alh his 12: nEm yes 16: >n that 13: >n that12: fy in 10: >n that 16: fy in 10: mA what9: lA no 9: mA what 16: bn son 9: lA no8: mA what 9: w and 13: mn from 6: Al|n now6: w or 12: lA no 6: hw him5: hw him 6: yEny means 10: Tyb alright 6: yEny means4: >nA me 5: lA no 11: hw him 5: yA (calling)

REF_BW : AltqsyT fy $y' slby wfy $y' fy $y' DrwryAt mvlA llbyt .

Translation : The installment has negative things and positive things now there arenecessities, for example for a house.

E2E : fy $y' slby fy $y' nA fy $y' DrwryAt mvlA llbyt

HMM-DNN : hy t>Syl b$y' slby b$y'

REF : تصريحلرافسنجانيفيأكتوبرألفينوتمانيةحينماكانالأسديفاوضإيرانسرا

REF_BW : tSryH lrAfsnjAny fy >ktwbr >lfyn wtmAnyp HynmA kAn Al>sd yfAwD

Translation : Rafsanjani's statement in October two thousand and eight while Assadwas secretly negotiating with Iran

E2E : tSryH rfsnjAny fy >ktwbr >lfyn wtmAnyp HynmA kAn Al>sd yfAwD ktwbr HynmA kAn Al>sd yfAwD

REF : الفيديوبعدماانتشربهيكسرعةيعنيأعطانيدافعإنيأصيرهيككلأحداث

REF_BW : Alfydyw bEd mA Ant$r bhyk srEp yEny >ETAnydAfE Syr hyk kl >HdAv

Translation : The video after it spread with such speed. I mean it gave me anincentive that I would like that all events.

E2E : fy $hr mAyw fy $hr mAyw fy bEd mA Ant$r bhyk bsrEp yEny >ETAnydAfE Eny >Syr hyk kl >HdAv

HMM-DNN : Alfydyw bEd mA Ant$r bhyk srEp yEny >ETAny dAfE Syrhyk kl >HdAv

In this sections, the effect of the size of the data on both our proposed E2E Trans-former and HMM-DNN modular systems is examined. The configuration used forboth E2E Transformer and HMM-DNN modular systems is described in Section 4.5.The size of the training data was chosen from the following points (250h, 550h, and1200h). For consistency, the development and testing data were kept the same for alltraining data sizes. The performance of both E2E and modular systems for each datasize is illustrated in Figure 5. It can be seen that the performance of the modular ASRsystem is much better at datasize lower than around 400h. However, as the datasizeincreases, the E2E performance improves with much steeper trend compared to themodular system, which indicates that with more data, the E2E model is expected toshow further improvements. In addition, the results show that the E2E performanceoutperforms the modular system after around 400h of data size, which dispel the myth igure 5: Performance of HMM-DNN and E2E ASR systems measured in WER for differenttraining datasizes. about the need of a huge amount of private data to match the performance of a modularsystem [44]. In this section, the effect of the LM on the acoustic E2E transformer performanceusing the transcription text (TR) and background (BG) text is investigated. The WER%and the real time factor (RT) of the acoustic E2E transformer with RNN LM andTransformer LM architectures are summarized in Table 6. It can be seen that as the

Table 6: Comparison (WER% and RT factor) of E2E transformer-based models with differentLM rescoring configuration on MGB2 test set. The language models were trained with thetranscription text (TR) and background (BG) text

Method Beam 20 Beam 5 Beam 2

WER RT WER RT WER RTE2E_Transformer+LSTM_LM (BG+TR) 13.4 3.65 14.6 0.9 14.7 0.44E2E_Transformer+Transformer_LM (BG+TR) 12.7 5.87 13.1 1.49 13.5 0.62E2E_Transformer+Transformer_LM (TR) 12.6 5.87 13 1.49 13.4 0.62E2E_Transformer able 7: E2E LM standardized confusion pairs E2E Transformer E2E Transformer LM Translation dy/hAy دي/هاي h*h هذه this

SegmentationEvaluation set

Hs IS Imp_ISMGB2_Test 14.7 72 19.1Hidden_Test 13.5 32.2 15.3

In this section, the performance of E2E transformer is studied on two dialectal Ara-bic ASR challenges MGB3 and MGB5 [19, 37]. For more details about the data, referto section 4.3 and Table 1. As DA is lacking standard orthographic rules as well assizable transcribed data, it is considered to be an excellent choice to highlight the chal- enges of speech recognition in the wild. The best E2E transformer obtained in Section5.5 is benchmarked with the MGB3 and the MGB5 state-of-the-art results. The E2Etransformer was fine-tuned on the MGB3 and the MGB5 adaptation sets independently.The hyper-parameters for the transformer model are similar to what was used for theMGB2 training described in Section 4.5 except that the learning rate is reduced to 0.1and no warm-up steps. The model is initialized with the E2E transformer parameterspretrained on the MGB2 from Section 5.5. For data preprocessing, the same data aug-mentation described in Section 4.4 was followed. Each sentence in the adaptation andthe development data were transcribed by four different annotators to explore the non-orthographic nature of dialectal Arabic. In our experiments, the transcripts from thefour transcribers were combined, which increased the amount of the data four times.Given that participants in the MGB3 and the MGB5 challenges have no access to thetest set, we report two sets of results: 1) Fine-tune the model using the adaptation setonly for epochs and use the development set to monitor the performance and selectthe model with the best result; 2) Fine-tune the model on adaptation set for epochsand then train the best configuration obtained on both adaptation and development setscombined for an additional epochs. The results of both approaches are illustrated inTable 9. It can be seen from Table 9 that the single E2E_Transformer outperforms Table 9: MR-WER% & AV-WER% results on two dialectal datasets MGB3 and MGB5.

MGB3 MGB5

MR-WER AV-WER MR-WER AV-WERAalto [6] 29.3 37.5 - -RDI-CU [37] - - 37.6 59.4E2E_Transformer(adapt) 29.2 36.0 34.9 57.2E2E_Transformer(adapt+dev) the state-of-the-art modular HMM-DNN systems in both of the DA challenges. Wesee about % relative reduction in AV-WER while using the adaptation data only, andbetween - % relative reduction in AV-WER when using both the adaptation and thedevelopment data, which significantly outperforms the previous state-of-the-arts in DAASR. This is a new milestone for DA speech recognition. . Conclusion In this paper, we presented the first comprehensive study comparing head to headE2E ASR, modular HMM-DNN ASR and HSR on Arabic speech. We provided a com-prehensive error analysis comparing the best ASR system performance to the expertlinguist and native speaker. It has been found that the machine ASR arguably outper-forms the performance of the native speaker, however, the WER gap to reach expertlinguist performance is still on average . % on the raw Arabic transcription text. Itwas noticeable that the machine mistakes showed high similarity with the expert linguisttranscription. Additionally, we developed the first E2E transformer for the Arabic ASRand its dialects. The proposed E2E transformer significantly outperformed prior state-of-the-art on MGB2, MGB3 and MGB5 achieving a new state-of-the-art performanceat . %, . % and . % respectively. Moreover, it has been found that, in practicalASR, the segment duration has a severe impact on E2E transformer performance. Toaddress the problem of segment duration variability, a new VAD pipeline with a max-imum duration threshold was proposed. For future work, we plan to address the gapbetween human and machine in Arabic ASR and address the low resource challenge indialectal Arabic, which still shows a high error rate. References [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case,J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, in: International conference onmachine learning, 2016, pp. 173--182.[2] A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrentneural networks, in: International conference on machine learning, 2014, pp.1764--1772.[3] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig,Achieving human parity in conversational speech recognition, arXiv preprintarXiv:1610.05256.

4] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui,B. Ramabhadran, M. Picheny, L.-L. Lim, et al., English conversational telephonespeech recognition by humans and machines, Proc. Interspeech (2017) 132--136.[5] P. Michalowski, The lives of the sumerian language, Margins of writing, originsof cultures (2006) 159--84.[6] P. Smit, S. R. Gangireddy, S. Enarvi, S. Virpioja, M. Kurimo, Aalto system forthe 2017 Arabic multi-genre broadcast challenge, in: IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU), IEEE, 2017, pp. 338--345.[7] H. Mubarak, A. Abdelali, H. Sajjad, Y. Samih, K. Darwish, Highly effective arabicdiacritization using sequence to sequence modeling, in: Proceedings of the 2019Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers),2019, pp. 2390--2395.[8] A. Das, M. Hasegawa-Johnson, Cross-lingual transfer learning during supervisedtraining in low resource scenarios, in: Sixteenth Annual Conference of the Inter-national Speech Communication Association, 2015.[9] S. Khurana, A. Ali, J. Glass, Darts: Dialectal Arabic transcription system, arXivpreprint arXiv:1909.12163.[10] A. Ahmed, Y. Hifny, K. Shaalan, S. Toral, End-to-end lexicon free Arabic speechrecognition using recurrent neural networks, Computational Linguistics, SpeechAnd Image Processing For Arabic Language 4 (2018) 231.[11] T. Kudo, Subword regularization: Improving neural network translation modelswith multiple subword candidates, arXiv preprint arXiv:1804.10959.[12] S. Khurana, A. Ali, Qcri advanced transcription system (QATS) for the Arabicmulti-dialect broadcast media recognition: MGB-2 challenge, in: IEEE SpokenLanguage Technology Workshop (SLT), IEEE, 2016, pp. 292--298.

13] D. Doukhan, E. Lechapt, M. Evrard, J. Carrive, Ina’s mirex 2018 music and speechdetection system, in: Music Information Retrieval Evaluation eXchange (MIREX2018), 2018.[14] T. Giannakopoulos, pyaudioanalysis: An open-source python library for audiosignal analysis, PloS one 10 (12) (2015) e0144610.[15] G. E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition, IEEE Transactions on audio,speech, and language processing 20 (1) (2011) 30--42.[16] A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep recurrentneural networks, in: IEEE international conference on acoustics, speech and signalprocessing (ICASSP), IEEE, 2013, pp. 6645--6649.[17] Y. Hifny, Unified acoustic modeling using deep conditional random fields, Trans-actions on Machine Learning and Artificial Intelligence 3 (2) (2015) 65--65.[18] V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecturefor efficient modeling of long temporal contexts, in: Sixteenth Annual Conferenceof the International Speech Communication Association, 2015.[19] A. Ali, S. Vogel, S. Renals, Speech recognition challenge in the wild: Arabic MGB-3, in: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),IEEE, 2017, pp. 316--322.[20] D. Wang, X. Wang, S. Lv, An overview of end-to-end automatic speech recogni-tion, Symmetry 11 (8) (2019) 1018.[21] W. Chan, N. Jaitly, Q. Le, O. Vinyals, Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition, in: IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016,pp. 4960--4964.[22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-basedmodels for speech recognition, Advances in neural information processing systems28 (2015) 577--585.

23] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, T. Hayashi, Hybrid ctc/attentionarchitecture for end-to-end speech recognition, IEEE Journal of Selected Topicsin Signal Processing 11 (8) (2017) 1240--1253.[24] Y. Belinkov12, A. Ali, J. Glass, Analyzing phonetic and graphemic representationsin end-to-end automatic speech recognition, Proc. Interspeech.[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,I. Polosukhin, Attention is all you need, in: Advances in neural information pro-cessing systems, 2017, pp. 5998--6008.[26] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y.Soplin, R. Yamamoto, X. Wang, et al., A comparative study on transformer vs rnnin speech applications, in: IEEE Automatic Speech Recognition and UnderstandingWorkshop (ASRU), IEEE, 2019, pp. 449--456.[27] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjan-dra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybridspeech recognition, in: IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), IEEE, 2020, pp. 6874--6878.[28] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram,V. Liptchinsky, R. Collobert, End-to-end ASR: from supervised to semi-supervisedlearning with modern architectures, arXiv preprint arXiv:1911.08460.[29] A. Stolcke, J. Droppo, Comparing human and machine errors in conversationalspeech transcription, arXiv preprint arXiv:1708.08615.[30] A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak, S. Renals, Y. Zhang, The mgb-2 challenge: Arabic multi-dialect broadcast media recognition, in: IEEE SpokenLanguage Technology Workshop (SLT), IEEE, 2016, pp. 279--284.[31] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang,S. Khudanpur, Purely sequence-trained neural networks for ASR based on lattice-free MMI, in: Proc. Interspeech, 2016, pp. 2751--2755.

32] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Han-nemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognitiontoolkit, in: IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), no. CONF, IEEE Signal Processing Society, 2011.[33] D. Povey, H. Hadian, P. Ghahremani, K. Li, S. Khudanpur, A time-restricted self-attention layer for ASR, in: IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), IEEE, 2018, pp. 5874--5878.[34] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, S. Khudanpur,A pitch extraction algorithm tuned for automatic speech recognition, in: IEEEinternational conference on acoustics, speech and signal processing (ICASSP),IEEE, 2014, pp. 2494--2498.[35] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprintarXiv:1607.06450.[36] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:Proceedings of the IEEE conference on computer vision and pattern recognition,2016, pp. 770--778.[37] A. Ali, S. Shon, Y. Samih, H. Mubarak, A. Abdelali, J. Glass, S. Renals, K. Choukri,The MGB-5 challenge: Recognition and dialect identification of dialectal Arabicspeech, in: IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), IEEE, 2019, pp. 1026--1033.[38] A. Ali, W. Magdy, P. Bell, S. Renais, Multi-reference wer for evaluating asr forlanguages with no orthographic rules, in: IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU), IEEE, 2015, pp. 576--580.[39] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.-E. Y.Soplin, J. Heymann, M. Wiesner, N. Chen, et al., Espnet: End-to-end speechprocessing toolkit, Proc. Interspeech (2018) 2207--2211.

40] A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, J. Glass, A complete kaldirecipe for building arabic speech recognition systems, in: IEEE spoken languagetechnology workshop (SLT), IEEE, 2014, pp. 525--529.[41] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speechrecognition, in: Sixteenth Annual Conference of the International Speech Com-munication Association, 2015.[42] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le,Specaugment: A simple data augmentation method for automatic speech recogni-tion, Proc. Interspeech (2019) 2613--2617.[43] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical trans-formers for long document classification, in: IEEE Automatic Speech Recognitionand Understanding Workshop (ASRU), IEEE, 2019, pp. 838--844.[44] A. Zeyer, K. Irie, R. Schlüter, H. Ney, Improved training of end-to-end attentionmodels for speech recognition, arXiv preprint arXiv:1805.03294.40] A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, J. Glass, A complete kaldirecipe for building arabic speech recognition systems, in: IEEE spoken languagetechnology workshop (SLT), IEEE, 2014, pp. 525--529.[41] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speechrecognition, in: Sixteenth Annual Conference of the International Speech Com-munication Association, 2015.[42] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le,Specaugment: A simple data augmentation method for automatic speech recogni-tion, Proc. Interspeech (2019) 2613--2617.[43] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical trans-formers for long document classification, in: IEEE Automatic Speech Recognitionand Understanding Workshop (ASRU), IEEE, 2019, pp. 838--844.[44] A. Zeyer, K. Irie, R. Schlüter, H. Ney, Improved training of end-to-end attentionmodels for speech recognition, arXiv preprint arXiv:1805.03294.