Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals
MMulticlass Language Identification using DeepLearning on Spectral Images of Audio Signals
Shauna Revay and Matthew Teschke NovettaMay 14, 2019
Recently, voice assistants have become a staple in the flagship products of manybig technology companies such as Google, Apple, Amazon, and Microsoft. Onechallenge for voice assistant products is that the language that a speaker is usingneeds to be preset. To improve user experience on this and similar tasks such asautomated speech detection or speech to text transcription, automatic languagedetection is a necessary first step.The technique described in this paper, language identification for audio spec-trograms (LIFAS), uses spectrograms of raw audio signals as input to a convolu-tional neural network (CNN) to be used for language identification. One benefitof this process is that it requires minimal pre-processing. In fact, only the rawaudio signals are input into the neural network, with the spectrograms gener-ated as each batch is input to the network during training. Another benefit isthat the technique can utilize short audio segments (approximately 4 seconds)for effective classification, necessary for voice assistants that need to identifylanguage as soon as a speaker begins to talk.LIFAS binary language classification had an accuracy of 97%, and multi-classclassification with six languages had an accuracy of 89%.
Finding a dataset of audio clips in various languages sufficiently large for traininga network was an initial challenge for this task. Many datasets of this type arenot open sourced [7]. VoxForge [15], an open-source corpus that consists ofuser-submitted audio clips in various languages, is the source of data used inthis paper.Previous work in this area used deep networks as feature extractors, butdid not use the networks themselves to classify the languages [4, 10]. LIFAS1 a r X i v : . [ c s . S D ] M a y emoves any feature extraction performed outside of the network. The networkis fed a raw audio signal, and the spectrogram of the data is passed to theneural network during training. The last layer of the network outputs a vectorof probabilities with one prediction per language. Thus, the whole process fromraw audio signal to prediction of language is performed automatically by theneural network.In [2], a CNN was combined with a long short-term memory (LSTM) networkto classify language using spectrograms generated from audio. The network pre-sented in [2] classified 4 languages using 10-second audio clips for training [12],while LIFAS achieves similar performance for 6 languages using 4-second audioclips. This demonstrates the robustness of the architecture and its improvementupon earlier techniques. CNNs have been shown to give state of the art results for image classificationand a variety of other tasks. As neural networks using back propagation wereconstructed to be deeper, with more layers, they ran into the problem of vanish-ing gradient [14]. A network updates its weights based on the partial derivativesof the error function from the previous layers. Many times, the derivatives canbecome very small and the weight updates become insignificant. This can leadto a degradation in performance.One way to mitigate this problem is the use of Residual Neural Networks(ResNets [5]). ResNets utilize skip connections in layers which connects twonon-adjacent layers. ResNets have shown state-of-the-art performance on imagerecognition tasks, which makes them a natural choice for a network architecturefor this task [6].
A spectrogram is an image representation of the frequencies present in a signalover time. The frequency spectrum of a signal can be generated from a timeseries signal using a Fourier Transform.In practice, the Fast Fourier Transform (FFT) can be applied to a section ofthe time series data to calculate the magnitude of the frequency spectrum for afixed moment in time. This will correspond to a line in the spectrogram. Thetime series data is then windowed, usually in overlapping chunks, and the FFTdata is strung together to form the spectrogram image which allows us to seehow the frequencies change over time.Since we were generating spectrograms on audio data, the data was convertedto the mel scale, generating ”melspectrograms”. These images will be referredto as simply ”spectrograms” for the duration of this paper. The conversion from f hertz to m mels that we use is given by, m = 2595 log (cid:18) f (cid:19) . Audio data was collected from VoxForge [15]. Each audio signal was sampledat a rate of 16kHz and cut down to be 60,000 samples long. In this context, asample refers to the number of data points in the audio clip. This equates to3.75 seconds of audio. The audio files were saved as WAV files and loaded intoPython using the librosa library and a sample rate of 16kHz.Each audio file of 60,000 samples was saved separately and is referred to asa clip. The training set consisted of 5,000 clips per language, and the validationset consisted of 2,000 clips per language.Audio clips were gathered in English, Spanish, French, German, Russian,and Italian. Speakers had various accents and were of different genders. Thesame speakers may be speaking in more than one clip, but there was no crosscontamination in the training and validation sets.Spectrograms were generated using parameters similar to the process dis-cussed in [1] which used a frequency spectrum of 20Hz to 8,000Hz and 40 fre-quency bins. Each FFT was computed on a window of 1024 samples. No otherpre-processing was done on the audio files. Spectrograms were generated on-the-fly on a per-batch basis while the network was running (i.e. spectrogramswere not saved to disk).
We utilized the fast.ai [3] deep learning library built on PyTorch [9]. The net-work used was a pretrained Resnet50. The spectrograms were generated on aper-batch basis, with a batch size of 64 images. Each image was 432 ×
288 pixelsin size.During training, the 1-cycle-policy described in [11] was used. In this process,the learning rate is gradually increased and then decreased in a linear fashionduring one cycle [13]. The learning rate finder within the fast.ai library was firstused to determine the maximum learning rate to be used in the 1-cycle trainingof the network. The maximum learning rate was then set to be 1 × − .3he learning rate increases until it hits the maximum learning rate, and thenit gradually decreases again. The length of the cycle was set to be 8 epochs,meaning that throughout the cycle 8 epochs are evaluated. Binary classification was performed on two languages using clips of 60,000 sam-ples. English and Russian were chosen to use for training and validation. To testthe impact of the number of samples on classification while keeping the sam-ple rate constant, binary classification was also performed on clips of 100,000samples.
For each language (English, Spanish, German, French, Russian, and Italian),5,000 clips were placed in the training set. Each clip was 60,000 samples inlength. 2,000 clips per language were placed in the validation set, and no speak-ers or clips appeared in both the training and validation sets.
Accuracy was calculated for both binary classification and multiclass classifica-tion as:
Accuracy = N umber of Correct P redictionsT otal N umber of P redictions .
LIFAS binary classification accuracy for Russian and English clips of length60,000 samples was 94%. In comparison, LIFAS binary classification accuracyon the clips of 100,000 samples was 97 %. The accuracy totals given in the ex-periments above are calculated on the total number of clips in the validation set.The accuracy can also be broken up into accuracy for English clips, or accuracyfor Russian clips, where there was essentially no difference in the accuracy forEnglish clips and the accuracy for Russian clips.To confirm that the network performance was not dependent on English andRussian language data, binary classification was tested on other languages withlittle to no impact on validation accuracy.LIFAS accuracy for the multi-class network with six languages was 89 %.These results were based on clips of 60,000 samples since a sufficient numberof longer clips were unavailable. Results from the 100,000 sample clips in thebinary classification model suggest performance could be improved in the multi-class setting with longer clips.The confusion matrix for the multi-class classification is shown in figure 2.4igure 2: The confusion matrix for the multiclass language identification prob-lem. 5
Discussion and Limitations
Notably, the highest rate of false negative classifications came when Spanish clipswere classified as Russian, and when Russian clips were classified as Spanish.Additionally, almost no other language is misclassified as Russian or Spanish.One hypothesis for this observation is the fact that Russian is the only Slaviclanguage in the training set. Therefore, the network may be performing somethresholding at one layer that separates Russian from other languages, and bychance Spanish clips are near the threshold.One limitation in our findings is that all of the data came from the samedataset. Since audio formats can have a wide variety of parameters such asbit rate, sampling rate, and bits per sample, we would expect clips from otherdatasets collected in different formats to potentially confuse the network. Thereis potential for this drawback to be overcome assuming appropriate pre-processingsteps were taken for the audio signals so that the spectrograms did not containartifacts from the dataset itself. This is a problem that should be explored asmore data becomes available.
This work shows the viability of using deep network architectures commonlyused for image classification in identifying languages from images generated fromaudio data. Robust performance can be accomplished using relatively short fileswith minimal pre-processing. We believe that this model can be extended toclassify more languages so long as sufficient, representative training and valida-tion data is available. A next step in testing the robustness of this model wouldbe to include test data from additional (e.g. non-VoxForge) datasets.Additionally, we would want the network to be performant on environmentswith varying levels of noise. VoxForge data is all user submitted audio clips, sothe noise profiles of the clips vary, but more regimented tests should be done tosee how robust the network is to different measured levels of noise. Simulatedadditive white Gaussian noise could be added to the training data to simulatelow quality audio, but still might not fully mimic the effect of background noisesuch as car horns, clanging pots, or multiple speakers in a real life environment.Another way to potentially increase the robustness of the model would be toimplement SpecAugment [8] which is a method that distorts spectrogram imagesin order to help overfitting and increase performance of networks by feeding indeliberately corrupted images. This may help to add scalability and robustnessto the network, assuming the spectral distortions generated in SpecAugmentaccurately represent distortions in audio signals observed in the real world.6 eferences [1]
Audio Classification using FastAI and On-the-Fly Frequency Transforms . url : https : / / towardsdatascience . com / audio - classification -using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89 .[2] Christian Bartz et al. “Language Identification Using Deep ConvolutionalRecurrent Neural Networks”. In: CoRR abs/1708.04811 (2017). url : http://arxiv.org/abs/1708.04811 .[3] fast.ai . url : .[4] Sriram Ganapathy et al. “Robust Language Identification Using Convo-lutional Neural Network Features”. In: Interspeech 2014 (2014).[5] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:
CoRR abs/1512.03385 (2015). arXiv: . url : http://arxiv.org/abs/1512.03385 .[6] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .June 2016.[7] Mozilla.
Common Voice by Mozilla . url : voice.mozilla.org/en .[8] Daniel S. Park et al. “SpecAugment: A Simple Data Augmentation Methodfor Automatic Speech Recognition”. In: arXiv (2019). url : http : / /arxiv.org/abs/1904.08779 .[9] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: NIPS-W .2017.[10] Fred Richardson, Douglas A. Reynolds, and Najim Dehak. “A UnifiedDeep Neural Network for Speaker and Language Recognition”. In:
CoRR abs/1504.00923 (2015). url : http://arxiv.org/abs/1504.00923 .[11] Leslie N. Smith. “A disciplined approach to neural network hyper-parameters:Part 1 - learning rate, batch size, momentum, and weight decay”. In: CoRR abs/1803.09820 (2018). url : http://arxiv.org/abs/1803.09820 .[12] Spoken language identification with deep convolutional networks . url : https://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/ .[13] The 1cycle Policy . url : https : / / sgugger . github . io / the - 1cycle -policy.html .[14] The Vanishing Gradient Problem . url : https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 .[15] VoxForge . url : voxforge.orgvoxforge.org