[PDF] Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals

Abstract

The first step in any voice recognition software is to determine what language a speaker is using, and ideally this process would be automated. The technique described in this paper, language identification for audio spectrograms (LIFAS), uses spectrograms generated from audio signals as inputs to a convolutional neural network (CNN) to be used for language identification. LIFAS requires minimal pre-processing on the audio signals as the spectrograms are generated during each batch as they are input to the network during training. LIFAS utilizes deep learning tools that are shown to be successful on image processing tasks and applies it to audio signal classification. LIFAS performs binary language classification with an accuracy of 97\%, and multi-class classification with six languages at an accuracy of 89\% on 3.75 second audio clips.

Full PDF

MMulticlass Language Identiﬁcation using DeepLearning on Spectral Images of Audio Signals

Shauna Revay and Matthew Teschke NovettaMay 14, 2019

Recently, voice assistants have become a staple in the ﬂagship products of manybig technology companies such as Google, Apple, Amazon, and Microsoft. Onechallenge for voice assistant products is that the language that a speaker is usingneeds to be preset. To improve user experience on this and similar tasks such asautomated speech detection or speech to text transcription, automatic languagedetection is a necessary ﬁrst step.The technique described in this paper, language identiﬁcation for audio spec-trograms (LIFAS), uses spectrograms of raw audio signals as input to a convolu-tional neural network (CNN) to be used for language identiﬁcation. One beneﬁtof this process is that it requires minimal pre-processing. In fact, only the rawaudio signals are input into the neural network, with the spectrograms gener-ated as each batch is input to the network during training. Another beneﬁt isthat the technique can utilize short audio segments (approximately 4 seconds)for eﬀective classiﬁcation, necessary for voice assistants that need to identifylanguage as soon as a speaker begins to talk.LIFAS binary language classiﬁcation had an accuracy of 97%, and multi-classclassiﬁcation with six languages had an accuracy of 89%.

Finding a dataset of audio clips in various languages suﬃciently large for traininga network was an initial challenge for this task. Many datasets of this type arenot open sourced [7]. VoxForge [15], an open-source corpus that consists ofuser-submitted audio clips in various languages, is the source of data used inthis paper.Previous work in this area used deep networks as feature extractors, butdid not use the networks themselves to classify the languages [4, 10]. LIFAS1 a r X i v : . [ c s . S D ] M a y emoves any feature extraction performed outside of the network. The networkis fed a raw audio signal, and the spectrogram of the data is passed to theneural network during training. The last layer of the network outputs a vectorof probabilities with one prediction per language. Thus, the whole process fromraw audio signal to prediction of language is performed automatically by theneural network.In [2], a CNN was combined with a long short-term memory (LSTM) networkto classify language using spectrograms generated from audio. The network pre-sented in [2] classiﬁed 4 languages using 10-second audio clips for training [12],while LIFAS achieves similar performance for 6 languages using 4-second audioclips. This demonstrates the robustness of the architecture and its improvementupon earlier techniques. CNNs have been shown to give state of the art results for image classiﬁcationand a variety of other tasks. As neural networks using back propagation wereconstructed to be deeper, with more layers, they ran into the problem of vanish-ing gradient [14]. A network updates its weights based on the partial derivativesof the error function from the previous layers. Many times, the derivatives canbecome very small and the weight updates become insigniﬁcant. This can leadto a degradation in performance.One way to mitigate this problem is the use of Residual Neural Networks(ResNets [5]). ResNets utilize skip connections in layers which connects twonon-adjacent layers. ResNets have shown state-of-the-art performance on imagerecognition tasks, which makes them a natural choice for a network architecturefor this task [6].

A spectrogram is an image representation of the frequencies present in a signalover time. The frequency spectrum of a signal can be generated from a timeseries signal using a Fourier Transform.In practice, the Fast Fourier Transform (FFT) can be applied to a section ofthe time series data to calculate the magnitude of the frequency spectrum for aﬁxed moment in time. This will correspond to a line in the spectrogram. Thetime series data is then windowed, usually in overlapping chunks, and the FFTdata is strung together to form the spectrogram image which allows us to seehow the frequencies change over time.Since we were generating spectrograms on audio data, the data was convertedto the mel scale, generating ”melspectrograms”. These images will be referredto as simply ”spectrograms” for the duration of this paper. The conversion from f hertz to m mels that we use is given by, m = 2595 log (cid:18) f (cid:19) . Audio data was collected from VoxForge [15]. Each audio signal was sampledat a rate of 16kHz and cut down to be 60,000 samples long. In this context, asample refers to the number of data points in the audio clip. This equates to3.75 seconds of audio. The audio ﬁles were saved as WAV ﬁles and loaded intoPython using the librosa library and a sample rate of 16kHz.Each audio ﬁle of 60,000 samples was saved separately and is referred to asa clip. The training set consisted of 5,000 clips per language, and the validationset consisted of 2,000 clips per language.Audio clips were gathered in English, Spanish, French, German, Russian,and Italian. Speakers had various accents and were of diﬀerent genders. Thesame speakers may be speaking in more than one clip, but there was no crosscontamination in the training and validation sets.Spectrograms were generated using parameters similar to the process dis-cussed in [1] which used a frequency spectrum of 20Hz to 8,000Hz and 40 fre-quency bins. Each FFT was computed on a window of 1024 samples. No otherpre-processing was done on the audio ﬁles. Spectrograms were generated on-the-ﬂy on a per-batch basis while the network was running (i.e. spectrogramswere not saved to disk).

We utilized the fast.ai [3] deep learning library built on PyTorch [9]. The net-work used was a pretrained Resnet50. The spectrograms were generated on aper-batch basis, with a batch size of 64 images. Each image was 432 ×

288 pixelsin size.During training, the 1-cycle-policy described in [11] was used. In this process,the learning rate is gradually increased and then decreased in a linear fashionduring one cycle [13]. The learning rate ﬁnder within the fast.ai library was ﬁrstused to determine the maximum learning rate to be used in the 1-cycle trainingof the network. The maximum learning rate was then set to be 1 × − .3he learning rate increases until it hits the maximum learning rate, and thenit gradually decreases again. The length of the cycle was set to be 8 epochs,meaning that throughout the cycle 8 epochs are evaluated. Binary classiﬁcation was performed on two languages using clips of 60,000 sam-ples. English and Russian were chosen to use for training and validation. To testthe impact of the number of samples on classiﬁcation while keeping the sam-ple rate constant, binary classiﬁcation was also performed on clips of 100,000samples.

For each language (English, Spanish, German, French, Russian, and Italian),5,000 clips were placed in the training set. Each clip was 60,000 samples inlength. 2,000 clips per language were placed in the validation set, and no speak-ers or clips appeared in both the training and validation sets.

Accuracy was calculated for both binary classiﬁcation and multiclass classiﬁca-tion as:

Accuracy = N umber of Correct P redictionsT otal N umber of P redictions .

LIFAS binary classiﬁcation accuracy for Russian and English clips of length60,000 samples was 94%. In comparison, LIFAS binary classiﬁcation accuracyon the clips of 100,000 samples was 97 %. The accuracy totals given in the ex-periments above are calculated on the total number of clips in the validation set.The accuracy can also be broken up into accuracy for English clips, or accuracyfor Russian clips, where there was essentially no diﬀerence in the accuracy forEnglish clips and the accuracy for Russian clips.To conﬁrm that the network performance was not dependent on English andRussian language data, binary classiﬁcation was tested on other languages withlittle to no impact on validation accuracy.LIFAS accuracy for the multi-class network with six languages was 89 %.These results were based on clips of 60,000 samples since a suﬃcient numberof longer clips were unavailable. Results from the 100,000 sample clips in thebinary classiﬁcation model suggest performance could be improved in the multi-class setting with longer clips.The confusion matrix for the multi-class classiﬁcation is shown in ﬁgure 2.4igure 2: The confusion matrix for the multiclass language identiﬁcation prob-lem. 5

Discussion and Limitations

Notably, the highest rate of false negative classiﬁcations came when Spanish clipswere classiﬁed as Russian, and when Russian clips were classiﬁed as Spanish.Additionally, almost no other language is misclassiﬁed as Russian or Spanish.One hypothesis for this observation is the fact that Russian is the only Slaviclanguage in the training set. Therefore, the network may be performing somethresholding at one layer that separates Russian from other languages, and bychance Spanish clips are near the threshold.One limitation in our ﬁndings is that all of the data came from the samedataset. Since audio formats can have a wide variety of parameters such asbit rate, sampling rate, and bits per sample, we would expect clips from otherdatasets collected in diﬀerent formats to potentially confuse the network. Thereis potential for this drawback to be overcome assuming appropriate pre-processingsteps were taken for the audio signals so that the spectrograms did not containartifacts from the dataset itself. This is a problem that should be explored asmore data becomes available.

This work shows the viability of using deep network architectures commonlyused for image classiﬁcation in identifying languages from images generated fromaudio data. Robust performance can be accomplished using relatively short ﬁleswith minimal pre-processing. We believe that this model can be extended toclassify more languages so long as suﬃcient, representative training and valida-tion data is available. A next step in testing the robustness of this model wouldbe to include test data from additional (e.g. non-VoxForge) datasets.Additionally, we would want the network to be performant on environmentswith varying levels of noise. VoxForge data is all user submitted audio clips, sothe noise proﬁles of the clips vary, but more regimented tests should be done tosee how robust the network is to diﬀerent measured levels of noise. Simulatedadditive white Gaussian noise could be added to the training data to simulatelow quality audio, but still might not fully mimic the eﬀect of background noisesuch as car horns, clanging pots, or multiple speakers in a real life environment.Another way to potentially increase the robustness of the model would be toimplement SpecAugment [8] which is a method that distorts spectrogram imagesin order to help overﬁtting and increase performance of networks by feeding indeliberately corrupted images. This may help to add scalability and robustnessto the network, assuming the spectral distortions generated in SpecAugmentaccurately represent distortions in audio signals observed in the real world.6 eferences [1]

Audio Classiﬁcation using FastAI and On-the-Fly Frequency Transforms . url : https : / / towardsdatascience . com / audio - classification -using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89 .[2] Christian Bartz et al. “Language Identiﬁcation Using Deep ConvolutionalRecurrent Neural Networks”. In: CoRR abs/1708.04811 (2017). url : http://arxiv.org/abs/1708.04811 .[3] fast.ai . url : .[4] Sriram Ganapathy et al. “Robust Language Identiﬁcation Using Convo-lutional Neural Network Features”. In: Interspeech 2014 (2014).[5] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:

CoRR abs/1512.03385 (2015). arXiv: . url : http://arxiv.org/abs/1512.03385 .[6] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .June 2016.[7] Mozilla.

Common Voice by Mozilla . url : voice.mozilla.org/en .[8] Daniel S. Park et al. “SpecAugment: A Simple Data Augmentation Methodfor Automatic Speech Recognition”. In: arXiv (2019). url : http : / /arxiv.org/abs/1904.08779 .[9] Adam Paszke et al. “Automatic diﬀerentiation in PyTorch”. In: NIPS-W .2017.[10] Fred Richardson, Douglas A. Reynolds, and Najim Dehak. “A UniﬁedDeep Neural Network for Speaker and Language Recognition”. In:

CoRR abs/1504.00923 (2015). url : http://arxiv.org/abs/1504.00923 .[11] Leslie N. Smith. “A disciplined approach to neural network hyper-parameters:Part 1 - learning rate, batch size, momentum, and weight decay”. In: CoRR abs/1803.09820 (2018). url : http://arxiv.org/abs/1803.09820 .[12] Spoken language identiﬁcation with deep convolutional networks . url : https://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/ .[13] The 1cycle Policy . url : https : / / sgugger . github . io / the - 1cycle -policy.html .[14] The Vanishing Gradient Problem . url : https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 .[15] VoxForge . url : voxforge.orgvoxforge.org