Learning Intonation Pattern Embeddings for Arabic Dialect Identification
LLearning Intonation Pattern Embeddings for Arabic Dialect Identification
Aitor Arronte Alvarez , Elsayed Sabry Abdelaal Issa Technical University of Madrid University of Hawaii at Manoa University of Arizona [email protected], [email protected]
Abstract
This article presents a full end-to-end pipeline for Arabic Di-alect Identification (ADI) using intonation patterns and acous-tic representations. Recent approaches to language and dialectidentification use linguistic-aware deep architectures that areable to capture phonetic differences amongst languages and di-alects. Specifically, in ADI tasks, different combinations of lin-guistic features and acoustic representations have been success-ful with deep learning models. The approach presented in thisarticle uses intonation patterns and hybrid residual and bidirec-tional LSTM networks to learn acoustic embeddings with no ad-ditional linguistic information. Results of the experiments showthat intonation patterns for Arabic dialects provide sufficientinformation to achieve state-of-the-art results on the VarDial17 ADI datatset, outperforming single-feature systems. Thepipeline presented is robust to data sparsity, in contrast to otherdeep learning approaches that require large quantities of data.We conjecture on the importance of sufficient information as acriterion for optimality in a deep learning ADI task, and moregenerally, its application to acoustic modeling problems. Smallintonation patterns, when sufficient in an information-theoreticsense, allow deep learning architectures to learn more accuratespeech representations.
Index Terms : arabic dialect identification, acoustic representa-tion learning, intonation patterns
1. Introduction
Dialect Identification (DID) is a special case of Language Iden-tification (LID), that presents specific challenges and prob-lems related to the linguistic similarity between dialects. Eventhough LID can be considered a well-understood problem,closely related dialects and language varieties still pose signif-icant challenges for their automatic recognition [1, 2]. Severalworkshops (WANLP) and challenges (VardDial, MGB) havecontributed to improve identification results by attracting re-searchers to this topic of study.Arabic has a large consonantal inventory and a small vo-calic one. 22 countries speak different dialects that differ in sev-eral phonetic characteristics and inventories with the standard,as well as amongst each other. Dialectal differences not onlyoccur because of their inventories but also because of their dif-ferent prosodic patterns. It has been attested that intonation canidentify the speaker’s dialect [3], and it is significant in identi-fying the speaker’s dialectal origin whether Eastern or WesternArabic dialects [4].Previous research on LID and DID using speech data canbe divided into studies that concentrate on lexical, phonotactic,and acoustic features. Traditionally, i-vector-based approacheshave been considered as state-of-the-art. Combinations of i-vectors and deep neural networks have resulted in important recognition gains in LID tasks [5, 6]. Research in Arabic Di-alect Identification (ADI) shows that using purely linguistic fea-tures such as words and characters does not improve perfor-mance over acoustic ones obtained with convolutional neuralnetworks (CNN) [7, 8]. Previous prosodic and phonotactic ap-proaches to the study of Arabic dialects have shown that intona-tion and rhythm significantly improve identification over purelyphonotactic-based approaches [3, 9]. More recently, end-to-endschemes to ADI using CNN and acoustic features have shownbetter performance than linguistic features alone. Although fu-sion systems tend to obtain better results overall [10]. Domainattentive end-to-end architectures without prior target informa-tion has shown robustness in ADI tasks [11], and adaptation tovarious domains. Overall, results in ADI tasks seem to indicatea strong acoustic component in the speech signal that is able tocapture the regularities and differences amongst Arabic dialects.In this article we present a full end-to-end pipeline for ADI,using intonation patterns and acoustic representations that re-quire no linguistic knowledge. The approach presented extractsintonation patterns from speech signals by first obtaining a con-tour approximation of f . The contour C ( f ) is a reduction,or simplification, of the fundamental frequency obtained fromthe raw audio signal. Patterns are then mined from C ( f ) usinga sequential pattern mining algorithm. Intonation embeddingsare learned from the intonation patterns based on the acousticfeatures using hybrid deep convolutional and recurrent architec-tures. We investigate the usefulness of short intonation patternsin the automatic identification of Arabic dialects and the effectof the sample size when using this type of representation.The main contributions of this article are: 1) we presentan intonation pattern embedding scheme for ADI that is ableto learn and identify Arabic dialects with a higher degree ofaccuracy than previous approaches in the VarDial 2017 ADIdataset. 2) The method presented learns quality representationsfrom short speech samples that can be useful in low-resourcecontexts. 3) The method is robust to data sparsity and noise.We make code and data publicly available .
2. Dialectal Speech Corpus
To test our approach, we use a dialectal Arabic dataset fromthe VarDial 2017 ADI challenge that it is publicly available andused in previous research. The goal of the VarDial ADI task wasto identify Arabic spoken dialects and their acoustic features todiscriminate at the utterance level between five Arabic varieties,namely Modern Standard Arabic (MSA), Egyptian (EGY), Gulf(GLF), Levantine (LAV), and North African (NOR) [2].The data comes from a multi-dialectal speech corpus cre-ated from high-quality broadcast, debate, and discussion pro-grams from Al Jazeera, and as such contains a combination of https://github.com/aitor-mir/ADI a r X i v : . [ ee ss . A S ] A ug able 1: The ADI dataset: examples (Ex.) in utterances, duration (Dur.) in hours, and words in 1000s.
Training Development TestDialect Dialect Ex. Dur. Words Ex. Dur. Words Ex. Dur. Words
Egyptian EGY 3,093 12.4 76 298 2 11.0 302 2.0 11.6Gulf GLF 2,744 10.0 56 264 2 11.9 250 2.1 12.3Levantine LAV 2,851 10.3 53 330 2 10.3 334 2.0 10.9MSA MSA 2,183 10.4 69 281 2 13.4 262 1.9 13.0North African NOR 2,954 10.5 38 351 2 9.9 344 2.1 10.3
Total % accuracy with a weighted F1 score [12]. The winning solution proposed an approachthat combines several kernels using multiple kernel learningwith two runs; Kernel Discriminant Analysis (KDA) and KernelRidge Regression (KRR) based on a combination of three stringkernels and a kernel based on i-vectors. Table 2 summarizes theresults of the two runs.Table 2: Results on the test set of KRR and KDA.
Run Kernel Accuracy F (macro) F (weighted) % % % % % % The same dataset was used for the MGB-3 challenge [13],and the highest accuracy score was reported at 75% . Laterstudies on the same dataset using deep learning architecturesreported accuracies of 73% for a single feature system and81.36% for a fusion system [10].
3. Intonation Pattern Discovery
From the data presented in section 2 intonation embeddings areextracted following a pipeline based on the following compo-nents: a contour approximation and simplification method, anda sequential pattern mining algorithm [14]. The main objec-tive of this pipeline is to extract statistically relevant intonationpatterns from the approximated f curve as shown in Figure1. The approximation function reduces the variability of thespeech signal and allows for a more compact representationfrom which patterns can be mined. This approach is used inMusic Information Retrieval applications for obtaining musiccontours [15]. Instead of looking for musically-tempered stepsor specific distances between frequencies, we use a univariateversion of the k-means algorithm to obtain groups of frequen-cies, from which contours can be extracted [16].The fundamental frequency f is obtained using Kaldi’s im-plementation [17], with minimum and maximum values for f set at 50 and 600 Hz respectively, and a window size of 256samples. The contour approximation C ( f ) is constructed byfirst extracting all points from f , and then k-means is usedto group points within a cluster. Once all clusters from a sin-gle speech signal are obtained, distances are estimated betweencluster points, resulting in a vector of contour points in the time Table 3: Intonation patterns and duration (hours) by dialect.
Intonation PatternsDialect Instances Duration
EGY 15,294 1.175GLF 17,010 1.254LAV 16,271 1.252MSA 10,984 0.789NOR 17,369 1.362domain.The following steps describe this approximation moreformally:• Given a set of points P in f we say that a line segment L is bounded by all points in P if P ⊆ K j where K j ∈ K is the j-th k-mean cluster in the set K of all clusters in f • We obtain the distance d ( · ) for all line segments in f given d ( L j − , L j ) and L j − ⊆ K j − and L j ⊆ K j .• The procedure outputs the approximated contour C as avector of points in the time domain.Once the set C of all contours is created, we apply the BIDEalgorithm [14] to obtain sequential patterns, generating dictio-naries of intonation patterns for all 5 dialects. We can say thata dictionary D l = { I l , ..., I kl } contains k patterns I that areclosed given a minimum support function for the l -th dialect.We set the minimum pattern length to 5, which represents thenumber of approximated contour points in f . Note that thecontour output represents distances not of frequencies directly,but of groups of distances as defined by the k-means algorithm.When this procedure is applied to the entire training datadescribed in section 2, the initial set of 13,825 speech instancesresults in a total of 76,928 intonation patterns. The mean dura-tion for the intonation patterns is 0.273 seconds, and the medianvalue is 0.253. Table 3 shows pattern distributions by dialect.Even though the number of training instances is much larger,the total duration of the training set of intonation patterns isonly 5.83 hours, 10.88% of the total training set.
4. Acoustic Representation Learning
From the dictionaries of patterns obtained in the procedure de-scribed in section 3 we extract acoustic features that will beused as the input for the different convolutional architectures inthe identification task. We frame the ADI task as an acousticrepresentation learning problem, where an architecture tries topredict the label of a given acoustic pattern based on the into-nation embedding learned. An intonation embedding is then afixed-length vector of the variable-length speech pattern. aw Speech Signal Mel-spectrogramF0 extractionC(f0) approximation
Pattern Discovery
Softmax
EGYMSANORGLFLAV
Input size: 128x112 Conv1 [3x3; 2x2] [3x3;1x1]BN1Conv2[ 3x3; 1x1]BN2Conv3 [ 3x3; 1x1]BN3ReLU
Intonation Pattern Extraction Neural Network Architecture
Figure 1:
ADI pipeline proposed with the Res-BLSTM architecture.
More formally, we can define a vector of frame-level acousticfeatures as Y = y T where each y t ∈ R d is a d -dimensionalfeature at the frame level. An acoustic embedding is thena function f ( Y ) that maps a variable length segment into afixed-dimensional space R d . We say that f ( y ) ≈ f ( y ) if || f ( y ) − f ( y ) || ≤ θ , where θ is a minimum acceptable simi-larity threshold of the embeddings.Previous research in DID and speaker recognition and ver-ification using deep learning methods, have used MFCC andfilterbanks normally concatenated with other acoustic or higherlevel linguistic features as the acoustic representation. In DIDtasks, MFCC features in combination with i-vectors [8], andMFCC and filterbank features with frames of 25ms [10, 18]achieve state-of-the-art accuracy results. In speaker recognitiontasks, the x-vector approach uses filterbanks with frame-lengthof 25 ms over a 3 second window[19]. In a pre-training phase,results with the approach and models used in this article indicatethat log mel-spectrograms with 128 mel frequency bins and 512samples per frame achieve better performance than MFCC (8%increase) or FBANK (6.25% increase). This is consistent withprevious results in ADI tasks [10], that point at spectrogramfeatures to work better with larger datasets, since they containmore information. A combination of convolutional and recurrent architectureswere tested in this article. We use as a baseline a convolutionalrecurrent neural network (CRNN) model and propose a hybridcombination of residual [20] and bidirectional LSTM [21] net-works (Res-BLSTM) with shallow residual blocks.The CRNN architecture is composed of 4 blocks that con-tain a convolutional layer, a batch normalization step, an expo-nential linear unit layer, a maxpool layer, and a dropout layer.The first block of layers uses convolutional filters of sizes 3x3with stride 2x2, also used in the maxpool layer. The rest of theblocks (2-4) use small filters of 3x3 with stride of 1 to capturesmall regularities in the data. The architecture uses a recurrentGRU network to learn the sequential properties of intonationpatterns.The hybrid model uses shallow residual blocks present inResnets as a front-end to process acoustic features, and a recur-rent BLSTM network to learn sequential characteristics of thespeech signal. We parameterize the first convolutional layer ofthe first residual block with 3x3 filters and stride of 2x2, and forthe remaining convolutional layers in all blocks a filter with ker-nel size of 3x3 and stride of 1. The recurrent layer learns froman input sequence X = { x , x , ..., x T } the best representa- tion that produces as output the sequence Y = { y , y , ..., y T } ,where X is a vector of acoustic features at the frame level. TheBLSTM is composed of a forward LSTM −→ f that estimates theforward hidden states −→ h , · · · , −→ h T . The backward LSTM ←− f obtains a backward representation of the hidden states by pro-cessing the sequence in reverse order, obtaining the backwardhidden states iterating back from t = T to 1. The concatenationof the output of the forward and backward networks −→ Y ⊕ ←− Y produces the embedding of a given pattern.Both architectures use a fully connected layer with 1024units and ReLU activations and a softmax layer for the classi-fication of the data instances. Both models were implementedusing the library Pytorch [22]. Figure 1 summarizes the entirepipeline and the presented Res-BLSTM architecture.
5. Experiments
To test the quality of the approach presented, we perform twoexperiments: one using intonation patterns obtained from thetest set of the VarDial 17 and MGB-3 ADI challenges, an an-other experiment using the very small random samples from thesame test dataset (between 0.25 and 1.3 seconds of duration).The main objective is to test whether the intonation pattern ap-proach taken is able to perform well with smaller datasets, and ifthe intonation patterns learned are able to generalize to contextswith more noisy data. Since we were interested in settings weredata may be limited, the development set was not used for theexperiments, and the models were not optimized using it. Thislimitation was intended to show the strength of the approachpresented.
We train both architectures with batch sizes in the set {
32, 40,80, 128 } , and with epochs {
10, 20, 30, 40 } . Two early stop-ping policies of 5 and 2 epochs were implemented. Decisionson parameter selection were based on maximizing accuracy andminimizing the loss function while considering a general princi-ple of computational efficiency: the best model should be ableto predict, in the minimum amount of time possible, the mostnumber of instances correctly. This is to avoid overfitting andto generalize over the largest sample space possible. ADAMoptimization [23] with a learning rate of 0.001 was employed tooptimize the architectures with the cross-entropy loss: loss ( y, ˆ y ) = − (cid:88) y log ˆ y (1)where y is the probability of the true class, and ˆ y is the proba-bility of the predicted.able 4: Results on the ADI task for the Intonation Patterns data and the original VarDial 17 test set.
Intonation Patterns dataset Original VarDial datasetTrain Test TestModel Accuracy Accuracy Accuracy F-1 (Weig.) (Shon et al., 2018; single ) 73.39(Shon et al., 2018; fusion) (Ionescu et al., 2017) 76.27 76.32CRNN 81.05 75.69Res-BLSTM
For training, 80% of the data described in Table 3 was used,while the remaining 20% was left for validation. The best com-bination of parameters [batch; epochs; early stop] for CRNNwere [80; 20; 5] while for Res-BLSTM [128; 15; 2].
Data augmentation has played an important role in accuracyimprovement in previous deep learning approaches to ADI[10, 11]. Augmentation increases the sample size of the train-ing set by creating perturbations to the audio signal such as timewarping, frequency and time masking, or directly modifying theacoustic representation [24], with the goal of improving the ro-bustness of the models and avoid overfitting. Data augmentationcan be considered a regularization technique.Instead of using augmentation directly on the dataset, theintonation pattern discovery approach presented, reduces thetraining set by segmenting longer speech audio signals intosmaller units (patterns) without perturbations to the signal itself.The result is an increase of the sample size by reducing the totalduration of the data to only 10.88% of the original training set.We call this approach signal reduction by segmentation, sincewe are extracting minimal patterns from longer audio signalsthat are statistically relevant, and richer in terms of the informa-tion content, while less valuable information is disregarded.
6. Results and Discussion
Results, as shown in Table 4, underline the main findings ofthe proposed approach: intonation pattern embeddings providesufficient information to achieve state-of-the-art results in theVarDial 17 ADI dataset with minimal sample length.Experimental results are divided by dataset type, and com-pared with state-of-the-art results in the original VarDial 17 testset using the same metrics as in previous research [10, 12].The baseline (CRNN) and the Res-BLSTM model presented,were trained only using the dataset shown in Table 3. Bothmodels were tested on the intonation patterns dataset, and ourbest model (Res-BLSTM), also on the original test set used forthe VarDial 17 and MGB-3 ADI challenges, but with a muchsmaller length of the speech samples, as described in section5. This restrictive approach allows us to test whether smallacoustic intonation embeddings learned by deep architectures,are able to generalize to a broader class of problems where datais not only sparse, but also noisy.Both models show no sign of overfitting as indicated by thetraining and test accuracy measures in the intonation patternsdataset, with a relative decrease of 4.7%. Both models showsignificant accuracy capabilities if we compare them with state-of-the-art results. This could be because the dataset has beenreduced to its most fundamental content, and information thatis irrelevant for the prediction task has been discarded. In other words, this can be seen as a reduction of the data complexityand an increase of the sample size by algorithmic means.Surprisingly, when the Res-BLSTM model is tested on theoriginal test set with small samples picked randomly, the modeloutperforms previous single-feature approaches and comes veryclose to the fusion system of Shon et al. [10]. This result can beseen as the model is able to learn small speech patterns, that aresmall enough, to be prevalent in Arabic dialectal speech data.The result is particularly interesting considering that the totaltraining duration of the dataset used is almost the same as thetest set. Also, it is well known from the VarDial and MGB-3reports [2, 13], that the training and test domains were inten-tionally mismatched to challenge participants, which shows therobustness of the full pipeline presented.The signal segmentation approach taken in this article,presents also an interesting case of deep learning optimiza-tion of speech signals. The data extracted from the originalspeech corpus significantly reduces the training time of the Res-BLSTM architecture. To achieve the results shown in Table 4,only 15 epochs were needed, with an early stopping policy of 2.It should be noted that recurrent networks in combination withresidual blocks with small filters, are able to capture local-levelfeatures in the convolutional layers, while processing sequentialpatterns in the recurrent (BLSTM) ones.
7. Conclusion
An intonation pattern embedding pipeline for the automaticidentification of Arabic dialects was presented. Overall, the pro-posed pipeline is able to extract small intonation patterns thatcontain sufficient information to achieve state-of-the-art resultson the VarDial 17 dataset. The pipeline requires minimal in-formation to learn high-quality acoustic embeddings, as resultsin the ADI task show, and reduces the learning time by reduc-ing the signal to be learned, and consequently the sample space.The combination of residual blocks and BLTSM networks pro-vide a compact model to learn more accurate acoustic represen-tations of the speech signal. The success of the approach pre-sented has to do also with the high acoustic information Arabicdialects contain as expressed in their intonation patterns, thatare sufficient to automatically identify them with short sam-ples. This approach can have many applications in low-resourcespeech identification systems, or in contexts where the signalhas been significantly degraded.
8. Acknowledgments
The technical support and advanced computing resourcesfrom University of Hawaii Information Technology Services-Cyberinfrastructure are gratefully acknowledged. . References [1] J. Tiedemann and N. Ljubeˇsi´c, “Efficient discrimination betweenclosely related languages,” in
Proceedings of COLING , 2012, pp.2619–2634.[2] M. Zampieri, S. Malmasi, N. Ljubeˇsi´c, P. Nakov, A. Ali, J. Tiede-mann, Y. Scherrer, and N. Aepli, “Findings of the VarDial eval-uation campaign 2017,” in
Proceedings of the Fourth Workshopon NLP for Similar Languages, Varieties and Dialects (VarDial) ,2017, pp. 1–15.[3] F. Biadsy and J. Hirschberg, “Using prosody and phonotactics inarabic dialect identification,” in
Tenth Annual Conference of theInternational Speech Communication Association , 2009.[4] M. Barkat, J. Ohala, and F. Pellegrino, “Prosody as a distinc-tive feature for the discrimination of arabic dialects,” in
Sixth Eu-ropean Conference on Speech Communication and Technology ,1999.[5] F. Richardson, D. Reynolds, and N. Dehak, “A unified deep neu-ral network for speaker and language recognition,” arXiv preprintarXiv:1504.00923 , 2015.[6] P. Cardinal, N. Dehak, Y. Zhang, and J. Glass, “Speaker adapta-tion using the i-vector technique for bottleneck features,” in
Six-teenth Annual Conference of the International Speech Communi-cation Association , 2015.[7] A. Ali, N. Dehak, P. Cardinal, S. Khurana, S. H. Yella, J. Glass,P. Bell, and S. Renals, “Automatic dialect detection in arabicbroadcast speech,” in
Interspeech , San Francisco, CA, USA,2016, pp. 2934–2938.[8] M. Najafian, S. Khurana, S. Shan, A. Ali, and J. Glass, “Exploit-ing convolutional neural networks for phonotactic based dialectidentification,” in . IEEE, 2018, pp.5174–5178.[9] F. Biadsy, J. Hirschberg, and N. Habash, “Spoken arabic dialectidentification using phonotactic modeling,” in
Proceedings of theeacl 2009 workshop on computational approaches to semitic lan-guages . Association for Computational Linguistics, 2009, pp.53–61.[10] S. Shon, A. Ali, and J. Glass, “Convolutional neural networks andlanguage embeddings for end-to-end dialect recognition,” arXivpreprint arXiv:1803.04567 , 2018.[11] ——, “Domain attentive fusion for end-to-end dialect identifica-tion with unknown target domain,” in
IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5951–5955.[12] R. T. Ionescu and A. Butnaru, “Learning to identify arabic andgerman dialects using multiple kernels,” in
Proceedings of theFourth Workshop on NLP for Similar Languages, Varieties andDialects (VarDial) , 2017, pp. 200–209.[13] A. Ali, S. Vogel, and S. Renals, “Speech recognition challenge inthe wild: Arabic mgb-3,” in . IEEE, 2017, pp.316–322.[14] J. Wang and J. Han, “Bide: Efficient mining of frequent closedsequences,” in
Proceedings. 20th international conference on dataengineering . IEEE, 2004, pp. 79–90.[15] N. Kroher and J.-M. D´ıaz-B´a˜nez, “Audio-based melody catego-rization: Exploring signal representations and evaluation strate-gies,”
Computer Music Journal , vol. 41, no. 4, pp. 64–82, 2018.[16] D. Qiu and A. C. Tamhane, “A comparative study of the k-meansalgorithm and the normal mixture model for clustering: Univari-ate case,”
Journal of Statistical Planning and Inference , vol. 137,no. 11, pp. 3722–3740, 2007.[17] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” in . IEEE,2014, pp. 2494–2498. [18] S. Shon, W.-N. Hsu, and J. Glass, “Unsupervised representationlearning of speech for dialect identification,” in . IEEE, 2018, pp. 105–111.[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2016, pp. 770–778.[21] A. Graves, S. Fern´andez, and J. Schmidhuber, “Bidirectionallstm networks for improved phoneme classification and recogni-tion,” in
International Conference on Artificial Neural Networks .Springer, 2005, pp. 799–804.[22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch:An imperative style, high-performance deep learning library,” in
Advances in Neural Information Processing Systems , 2019, pp.8024–8035.[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[24] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-tation method for automatic speech recognition,” arXiv preprintarXiv:1904.08779arXiv preprintarXiv:1904.08779