End-to-End Language Identification using Multi-Head Self-Attention and 1D Convolutional Neural Networks
EEND-TO-END LANGUAGE IDENTIFICATION USING MULTI-HEAD SELF-ATTENTIONAND 1D CONVOLUTIONAL NEURAL NETWORKS
Krishna D N, Ankita Patil
FreshWorks Inc., HashCut Inc.
ABSTRACT
In this work, we propose a new approach for language iden-tification using multi-head self-attention combined with rawwaveform based 1D convolutional neural networks for Indianlanguages. Our approach uses an encoder, multi-head self-attention, and a statistics pooling layer. The encoder learnsfeatures directly from raw waveforms using 1D convolutionkernels and an LSTM layer. The LSTM layer captures tem-poral information between the features extracted by the 1Dconvolutional layer. The multi-head self-attention layer takesoutputs of the LSTM layer and applies self-attention mech-anisms on these features with M different heads. This pro-cess helps the model give more weightage to the more usefulfeatures and less weightage to the less relevant features. Fi-nally, the frame-level features are combined using a statisticspooling layer to extract the utterance-level feature vector la-bel prediction. We conduct all our experiments on the 373 hrsof audio data for eight different Indian languages. Our experi-ments show that our approach outperforms the baseline modelby an absolute 3.69% improvement in F1-score and achievesthe best F1-score of 95.90%. Our approach also shows thatusing raw waveform models gets a 1.7% improvement in per-formance compared to the models built using handcrafted fea-tures.
Index Terms — multi-head self-attention,language iden-tification, 1D-CNNs
1. INTRODUCTION
A recent development in the area of deep neural networkshas shown tremendous improvements in speech systems, in-cluding speech recognition[1,2,3], emotion recognition[20],speaker identification [13]. Previously, the language iden-tification field was dominated by i-vectors [5], which usesGaussian mixture models traditionally. Even today, i-vectorsare considered to be the best model in the case of low dataregime. However, recent developments in the field of deeplearning show that deep neural networks are one of the dom-inating approaches in language identification. Previously, W.Geng et al. [4] proposed to use deep features extracted froma neural network trained for speech recognition and showedthat deep neural models are capable of obtaining excellent performance over the classical systems[7,9,10,15,19]. Re-cently, Time-delay neural networks(TDNN) have shown ex-cellent performance for speech recognition tasks [21]. X-vector [13] built for speaker identification was used for lan-guage identification [14], and its shown to be one of the bestmethods for language identification. The recent trend in se-quence to sequence mapping problems [18] involves the at-tention mechanism [16]. The attention mechanism is oneof the very well known techniques being used in sequencemapping problems [17], and today’s state of the art speechrecognition models are built using the attention-based mod-els. These models process sequential inputs by iterativelyselecting relevant features using attention technique. Due tothe efficiency of the attention technique for sequence map-ping problems, A. Vaswani et al. [22] propose to use stacksof attention layers alone. They have shown remarkable re-sults in machine translation. Motivated by the work [22], thispaper proposes to use multi-head self-attention in combina-tion with 1D convolutional neural network front-end process-ing for language identification. Ours proposed model takesraw waveform as input directly and extracts features that areuseful for the LID task. The model consists of a sequenceof Residual blocks [23] of 1D convolutional layers to extractfeatures from the raw audio. Since the convolutional layerdoes not have the capability to capture temporal information,we use an LSTM layer on top of the convolutional network tocapture time-varying information from the input. Finally, theLSTM output feature sequence is fed into a multi-head self-attention block consisting of multiple attention heads to selectimportant features from different parts of the input feature se-quence using attention weighting. Finally, an utterance levelfeature vector is generating using a statistics pooling layer,and we classify the utterance level feature to predict the classlabel.The organization of the paper is as follows. In section 2,we explain our proposed approach in detail. In section 3, wegive a detailed analysis of the dataset collection and curationprocess, and in section 4, we explain our experimental setupin detail. Finally, in section 5, we describe our results.
2. PROPOSED METHOD
In this section, we explain our proposed approach in detail.The detailed model architecture is shown in Figure 1. Our a r X i v : . [ ee ss . A S ] J a n odel consists of 3 main stages, 1) An Encoder layer, whichincludes multiple 1D convolutional layers with residual con-nections, an LSTM layer, 2) Multi-head self-attention layerto select important features for language identification usingattention weighting, and 3)statistic pooling layer to obtain ut-terance level feature vector for classification. The model takesraw audio waveform as input and applies initial 1D convolu-tion operations along with 1D max-pooling, as shown in Fig-ure 1. The initial convolutional layer features will go througha series of there 1D Residual blocks followed by an LSTMlayer. We then use multi-head self-attention to extract rele-vant features from different parts of the input. The statisticspooling layer generates an utterance level feature vector con-taining language discriminative properties. The output of thestatistics pooling layer gives us a single feature vector calledthe utterance level feature vector. This utterance level featurevector is fed into a projection layer followed by a softmaxlayer to predict the class label. We explain the details of eachof these blocks in the following section. Fig. 1 : Proposed model architecture
The encoder of our model consists of a series of three resid-ual blocks combined with a single LSTM layer, as shown inFigure 1. The encoder takes a raw waveform signal and ap-plies an initial 1D convolution operation consisting of 64 fil-ters of 1x7 filter size followed by a max-pooling operation.The max-pooling is applied with a kernel size of 1x3 withstride 2. After the initial convolution and max-pooling oper-ation, we send the output a sequence of residual blocks. Thedetails of a single residual block is shown in Figure 1. Eachresidual blocks operate with 1D convolution kernels of size1x3.
Residual Block -1 consist of 2*64 convolution kernelsof size 1x3,
Residual Block -2 consist of 2*128 convolutionkernels of size 1x3 and
Residual Block -3 contains 2*256 con-volution kernels of size 1x3. The output of
Residual Block -2 and
Residual Block -3 will go through a 1D max-poolingoperation. Since LSTMs are known to capture long-range dependencies between frames in speech, we use LSTM atthe end of the final residual block to capture temporal in-formation. The output of the final residual block
ResidualBlock -3 after the max-pooling operation is sent to a singleunidirectional LSTM layer with a hidden size of 256. Let X n = [ x , x ..x n , ...x N ] be raw audio sequence with Nsamples. H A = Encoder ( X n ) (1)Where, Encoder is a mapping function which consistsof initial 1D convolutional layer and max-pooling oper-ation, sequence of there residual blocks
Residual Block-1 , Residual Block-2 and
Residual Block-3 along with LSTMlayer . After this operation, we obtain a feature sequence H A = [ h , h .....h T ] of length T (T << N). Typically afterthe convolution operation, H A can be looked at as a featurematrix whose x-axis is a time dimension, and the y-axis is afeature dimension. The feature dimension in our case is 256,as the hidden layer size of the LSTM is the same. In this section, we describe multi-head self-attention in de-tail. It consists of 3 different linear blocks, one for query,one for key, and another for value. Each linear block con-sists of M independent linear layers. Where M is the num-ber of heads. The multi-head attention block takes features H A = [ h , h .....h T ] from LSTM and applies linear trans-formation to create Q i , K i and V i using i th linear layerswhere, i = [1 , .....M ] and M is the total number of at-tention heads. The Q i , K i and V i are fed into scaled dotproduct attention layer. The scaled dot product attention A i for i th head is defined as follows. A i = Softmax ( Q i K i d q ) V i (2)Where d q is the dimension of the query vector. We com-bine the attention output from all the heads using simple con-catenation and feed into the feed-forward layer. A = Concat ( A , A , A ...A i .....A M ) W (3)Where, A i is a d q x T dimensional matrix. Since the Con-cat operation is applied to the feature dimension of all the ma-trices, the final output attention matrix A from the multi-headattention block will have M d q x T matrix dimensions.The multi-head attention layer helps in finding featuresthat are more relevant for language identification. The scaleddot product attention achieves this by giving more weight-ing to the more relevant features and less weighting to lessrelevant features. Due to the presence of multiple heads inthe attention layer, this process selects features from differ-ent parts of the input and helps in obtaining better languageclassification performance. .3. Statistics pooling The idea of the statistics pooling layer is similar to max pool-ing. In the case of statistics pooling, we compute the meanand standard deviation from frame-level features. The meanand standard deviation features are concatenated to create theutterance level feature vector, as described in the equation be-low. Let A = [ a , a .....a T ] is the output from multi-headattention block. P = Concat ( mean ( A ) , std ( A )) (4)Where, a i is a feature vector of dimension M ∗ d q and P is final pooled feature vector using statistics pooling layer.Since the dimension of the utterance level feature vector is P become bigger when M is large, we add a projection layer ontop to the statistics pooling layer (Figure 1) in order to reducethe dimension of P . We take the output from this projectionlayer to visualize the utterance level embeddings for differentlanguages.
3. DATASET
In this section, we describe our data collection process. Wecollect and curate videos from Youtube using manual label-ing. We ask annotators to look for videos for eight languagesin Youtube and manually verify it make sure the video doesnot contain multiple Indian languages. Most of these videoscontain background noise or music signals. Sometime thevideo may contain a mix of English and other Indian lan-guages due to code-mixing. We use an in-house speech v/snon-speech detection model to detect only the speech seg-ments. We clip only the speech segments from every videoand discard the non-speech part of the video. After prepro-cessing, our total dataset contains 373.27 hrs of audio datafor 8 Indian languages. Our dataset includes Hindi , English , Kannada , Tamil , Telugu , Malayalam , Gujarati and
Marathi .These languages are officially spoken in the North and Southregions of India. We split the dataset into training and evalua-tion part, and the statistics of the training and evaluation partsare shown in Table 1.
4. EXPERIMENTS
We conduct all our experiments on in house dataset collectedfor 8 Indian languages. Our proposed model consists of anencoder, multi-head self-attention block, and statistics pool-ing layer followed by projection and a softmax layer. Werandomly select a 4sec audio signal from each audio file dur-ing training. Since our data has a sampling rate of 16KHz,we get 64000 samples from every file during training. Wefeed a 1x64000 dimensional signal into our encoder. Weconduct multiple experiments to see the effectiveness of the Datset Train EvalDuration Files Duration FilesEnglish 44.21 15963 11.28 4074Kannada 35.95 12988 8.90 3216Gujarati 30.24 10933 7.312 2642Hindi 38.79 14004 9.72 3510Malayalam 34.98 12636 8.94 3228Tamil 63.15 22774 15.95 5753Telugu 35.07 12666 8.54 3087Marathi 16.23 5873 4.013 1449
Table 1 : Train and evaluation splits for different languages(Duration is in Hrs)multi-head self-attention module for language identification.We first train a standalone 1D convolutional neural networkmodel as the first baseline model. We refer this system as
ResNet . We also train a 1D convolutional neural networkin combination with a unidirectional LSTM as a secondbaseline, and we refer to it as
ResNet-LSTM . Finally, ourproposed model is built using a 1D convolutional neural net-work, LSTM, and multi-head self-attention. We refer to it as
ResNet-LSTM-MHA-Raw .We conduct multiple experiments to see the effective-ness of our model on the duration of the audio during train-ing. We train 3 different models
ResNet-LSTM-MHA-2Sec , ResNet-LSTM-MHA-3Sec , ResNet-LSTM-MHA-4Sec whichtakes 2sec,3sec and 4sec audio data respectively during train-ing. Our final experiments study the effectiveness of usingraw waveform methods instead of handcrafted features. Weset up an experiment to train the model using MFCC fea-tures as inputs instead of the raw waveform to our model.We extract 13 dimensional MFCC (with delta and double-delta) feature for every 25ms using a 10ms frameshift forthis experiment. The MFCC based model is referred to as
ResNet-LSTM-MHA-MFCC while the raw waveform basedmodel is referred to as
ResNet-LSTM-MHA-Raw in this paper.We use Adam [24] optimizer to train all our models with alearning rate of 0.001 for up to 25 epochs. We use a batch sizeof 64 during training. We train all our models using Pytorch toolkit.
5. RESULTS
In this section, we describe the evaluation of different modelsand their performances. We train 2 baseline models
ResNet and
ResNet-LSTM . The first baseline model
ResNet consistsfo a sequence of 3 Residual blocks made up of 1D convolu-tion kernels. We can think of this model as a ResNet [23]with an average pooling layer replaced by a statistics poolinglayer. This model takes 4sec raw audio data and predicts thelanguage label. The
ResNet model has an F1-score of 88.67% https://pytorch.org/ n the test dataset. The second baseline model ResNet-LSTM consists of the same setting as baseline-1, but it has an extraLSTM layer on top of CNN in order to capture long-rangetemporal information. The performance of this model is92.21%, as shown in Table 2. We compare our baselinemodels with our proposed model
ResNet-LSTM-MHA-RAW ,which contains a multi-head attention layer and operates withthe raw waveform as input. Table 2 shows that our modelgets 3.69% absolute improvement in F1-score compared tothe second baseline model. We also create a model thattakes MFCC features as input instead of raw audio refer toas
ResNet-LSTM-MHA-MFCC . We show that raw waveformbased models can get 1.7% improvement over handcraftedfeature-based models.
System F1-Score
ResNet (baseline-1) 89.67%
ResNet-LSTM (baseline-2) 92.21%
ResNet-LSTM-MHA-MFCC (ours) 94.22%
ResNet-LSTM-MHA-RAW (ours) % Table 2 : Comparison of different architectures for languageidentification. Bold indicates the best performanceIn order to see the effect of the input length during train-ing, we conduct an experiment to train the model using 2Sec,3Sec, and 4Sec audio data, and we refer to these modelsas
ResNet-LSTM-MHA-2Sec , ResNet-LSTM-MHA-3Sec and
ResNet-LSTM-MHA-4Sec respectively. The results of theseexperiments are shown in Table 3. It shows that longer audiodata tends to improve the F1-score on the test data due tolonger context signals.
Table 3 : Comparison of models trained with different seg-ment duration. Bold indicates the best performance
System F1-Score
ResNet-LSTM-MHA-2Sec
ResNet-LSTM-MHA-3Sec
ResNet-LSTM-MHA-4Sec %Finally, we visualize the utterance level embeddings ex-tracted from the projection layer for all the languages. We ex-tract embeddings for 6500 randomly selected test utterancesfor t-sne visualization. Each embedding has dimension. Wereduce the dimension of the embeddings to 2 using the t-snetechnique. The t-sne plot of the 2-D embeddings is shownin Figure 2. It can be clearly seen that the proposed modellearns very good language discriminative features at the seg-ment level.
Fig. 2 : t-sne plot of utterence level embeddings
6. CONCLUSION
In this work, we propose a new architecture for languageidentification using multi-head self-attention and 1D convo-lutional neural networks. We propose to use raw waveformdirectly as input instead of handcrafted features to learnlanguage discriminative feature using 1D convolution op-erations. Our model uses multi-head self-attention to learnand select more important features for language identifica-tion task. We finally use a statistics pooling approach toextract utterance level language representation from frame-level features. We collect and curate 373hrs audio data for8 Indian languages
Hindi , English , Kannada , Tamil , Telugu , Malayalam , Gujarati and
Marathi . Our experiments showthat multi-head self-attention in combination with raw wave-form based 1D convolutional neural network model obtainsthe best performance on our evaluation dataset. We extractthe utterance level embeddings for our evaluation data andvisualize the clustering effect using t-sne. The visualiza-tion clearly shows that the model learns very good languagediscriminative features.
7. REFERENCES [1] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen,Attend and Spell: A Neural Network for Large Vocab-ulary Conversational Speech Recognition” in
ICASSP ,2016[2] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attentionbased endto-end speech recognition using multi-tasklearning,” in
IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2017, pp.4835–4839.[3] Ronan Collobert, Christian Puhrsch, and Gabriel Syn-aeve, “Wav2letter: an end-to-end convnet based speechrecognition system,”
CoRR , vol.abs/1609.03193, 2016.[4] W. Geng, J. Li, S. Zhang, X. Cai, and B. Xu, “Multi-lingual tandem bottleneck feature for language identifi-cation,” in
Sixteenth Annual Conference of the Interna-tional Speech Communication Association,
IEEE Transactions on Audio, Speech,and Lan-guage Processing vol. 19, no. 4,pp. 788–798, 2011.[6] Y. Song, B. Jiang, Y. Bao, S. Wei, and L.-R. Dai, “I-vector representation based on bottleneck features forlanguage identification,”
Electronics Letters, vol. 49, no.24, pp. 1569–1570, 2013[7] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot,D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno,“Automatic language identification using deep neuralnetworks”, in
Acoustics, Speech and Signal Process-ing (ICASSP), 2014 IEEE International Conference on.IEEE , 2014, pp. 5337–5341[8] S. Hochreiter and J. Schmidhuber, “Long short termmemory”,
Neural computation , 1997.[9] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J.Gonzalez Rodriguez, and P. J. Moreno, “Automatic lan-guage identification using long short-term memory re-current neural networks.” in
INTERSPEECH , 2014, pp.2155–2159.[10] S. Ganapathy, K. J. Han, S. Thomas, M. K. Omar, M.Van Segbroeck, and S. S. Narayanan, “Robust languageidentification using convolutional neural network fea-tures.”
ISCA INTERSPEECH,
INTERSPEECH 2015 [12] Garcia-Romero, D., McCree, A. “Stacked Long-TermTDNN for Spoken Language Recognition,”
Proc. Inter-speech 2016 , 3226- 3230.[13] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, andS. Khudanpur,“X-vectors: Robust dnn embeddings forspeaker recognition,” in 2018
IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP). IEEE , 2018[14] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D.Povey, and S. Khudanpur, “Spoken language recogni-tion using x-vectors,” in
Odyssey: The Speaker and Lan-guage Recognition Workshop , 2018 [15] C Bartz, T Herold, H Yang, and C Meinel ,“Languageidentification using deep convolutional recurrent neuralnetworks”,
CoRR ,abs/1708.04811, 2017[16] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machinetranslation by jointly learning to align and translate”, arXiv preprint arXiv:1409.0473, 2014[17] J. Chorowski, D. Bahdanau, K. Cho, and Y. Ben-gio, “End-to-end continuous speech recognition us-ing attention-based recurrent nn: first results,” arXivpreprint arXiv:1412.1602, 2014[18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in
Advancesin neural information processing systems , 2014, pp.3104–3112.[19] M. Ravanelli and Y.Bengio, “Speaker Recognition fromraw waveform with SincNet,” in
Proc. of SLT , 2018[20] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K.Sarma, and N. Dehak, “Emotion identication from rawspeech signals using dnns”,
Proc. Interspeech
Proceedings of INTER-SPEECH , 2015.[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L.Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “At-tention is all you need,” in
Advances in Neural Informa-tion Processing Systems , 2017, pp. 5998–6008[23] K. He, X. Zhang, S. Ren, and J. Sun. “Deep residuallearning for image recognition,”
Computer Vision andPattern Recognition (CVPR) , 2016[24] Diederik P. Kingma and Jimmy Ba, “Adam: A Methodfor Stochastic Optimization”, In
Proceedings of theInternational Conference on Learning Representations(ICLR) , 2014[25] Okabe, Koji and Koshinaka, Takafumi and Shinoda,Koichi,“Attentive Statistics Pooling for Deep SpeakerEmbedding”, In