On Scaling Contrastive Representations for Low-Resource Speech Recognition
Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars Maaløe, Christian Igel
OON SCALING CONTRASTIVE REPRESENTATIONSFOR LOW-RESOURCE SPEECH RECOGNITION
Lasse Borgholt , , Tycho M. S. Tax, Jakob D. Havtorn , Lars Maaløe and Christian Igel Department of Computer Science, University of Copenhagen, Denmark Corti, Copenhagen, [email protected]
ABSTRACT
Recent advances in self-supervised learning through con-trastive training have shown that it is possible to learn a com-petitive speech recognition system with as little as 10 min-utes of labeled data. However, these systems are computa-tionally expensive since they require pre-training followed byfine-tuning in a large parameter space. We explore the perfor-mance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations fromthe computationally demanding wav2vec 2.0 framework. Wefind performance to decrease without fine-tuning and, in theextreme low-resource setting, wav2vec 2.0 is inferior to itspredecessor. In addition, we find that wav2vec 2.0 represen-tations live in a low dimensional subspace and that decorrelat-ing the features of the representations can stabilize training ofthe automatic speech recognizer. Finally, we propose a bidi-rectional extension to the original wav2vec framework thatconsistently improves performance.
Index Terms — automatic speech recognition, unsuper-vised learning, semi-supervised learning, self-supervisedlearning, representation learning
1. INTRODUCTION
Unsupervised learning for automatic speech recognition(ASR) has recently gained significant attention [1, 2, 3, 4,5, 6, 7, 8, 9]. While the majority of work has focused onlearning representations encoding the input for downstreamtasks [1, 2, 4, 5, 6, 8, 9], the most promising results have beenachieved with the wav2vec 2.0 framework (Fig. 1) where apre-trained model is fine-tuned for speech recognition. How-ever, these models are computationally expensive due to thelarge amount of memory intensive transformer layers. Thiscontradicts the promise of easily applying these representa-tions for new ASR models on low resource languages [3].In contrast to wav2vec 2.0, its predecessor (Fig. 2) doesnot require fine-tuning as learned representations are used di-rectly as input for an ASR model [1]. In addition, the pre-trained model has an order of magnitude fewer parametersthan the large configuration of wav2vec 2.0. Because the x ENCODERCONTEXT
FWD
CONTEXT
BWD L FWD L BWD + zc xzqc ENCODERCONTEXTL
Fig. 1 . The wav2vec 2.0 framework [3]. The model is trainedto identify the correct quantized target corresponding to themasked latent representations. The two proposed configura-tions have 95 and 317 million parameters respectively. x ENCODERCONTEXT
FWD
CONTEXT
BWD L FWD L BWD + zc xzqc ENCODERCONTEXTL
Fig. 2 . The wav2vec framework [1] extended with a back-ward context network (shaded area). The two context net-works are independent, but are trained jointly with a sharedencoder. The original model has 33 million parameters, whileour extended model only has 18 million.frameworks are very similar, it seems obvious that representa-tions extracted from wav2vec 2.0 would also be suitable inputfor training an ASR model. Training on extracted representa-tions offers a light-weight alternative to the computationallyexpensive fine-tuning procedure described in [3].We study how representations from the two versions of theopen-source wav2vec framework compare when used as input a r X i v : . [ ee ss . A S ] F e b or low-resource end-to-end speech recognition. We also pro-pose a bidirectional extension to the original wav2vec frame-work that, similar to wav2vec 2.0, can use the entire latentsequence to learn contextualized representations. Our contri-butions are as follows:1. We find that ASR models trained on the wav2vec 2.0representations often end up in poor local minima.Decorrelating the feature space dimensions with PCAalleviates the training issues.2. We provide an overview of ASR models trained on con-trastive representations from publicly available mod-els. Despite using a strong ASR model, performance isheavily degraded compared to fine-tuning for wav2vec2.0. When given only 10 minutes of training data, theoriginal wav2vec model outperforms wav2vec 2.0.3. We propose a bidirectional extension to the originalwav2vec framework. Bidirectionality consistently im-proves performance of ASR models trained on the rep-resentations compared to representations from unidi-rectional baseline models.
2. CONTRASTIVE LEARNING FOR SPEECH2.1. wav2vec
In the wav2vec framework, a PCM signal x ∈ R T of se-quence length T is mapped to a sequence of latent representa-tions z = ENCODE ( x ) ∈ R U × D where U is the downsampledsequence length ( U = T / ) and D is the dimensionality ofthe latent representation. This latent representation dependsonly locally on x with ENCODE ( · ) parameterized by a con-volutional neural network. The latent sequence z is fed to acontext network to produce the representations used as inputfeatures for the downstream task c = CONTEXT ( z ) ∈ R U × D .In wav2vec, the context network is also convolutional but re-current neural networks are an obvious alternative. The modelis trained with a contrastive loss function inspired by con-trastive predictive coding [10] that maximizes the similaritybetween a contextualized representation c u and the k ’th fu-ture latent representation z u + k through a learned step-specificaffine transformation H k ∈ R D × D : SIM k ( z i , c j ) = log( σ ( z (cid:124) i H k c j )) (1) L k ( z , c ) = − U − k (cid:88) i =1 ( SIM k ( z i + k , c i ) (cid:88) d ∈D SIM k ( − z d , c i )) (2)where σ ( · ) denotes the standard logistic function and D is aset of randomly sampled integers d ∼ U{ , U } for indexingdistractor samples z d . The total loss is defined as the sum overall K temporal offsets, L = (cid:80) Kk =1 L k . For further details onthe wav2vec architecture and training procedure see [1]. Similar to the first version, wav2vec 2.0 also employs an en-coder and a context network, but uses a coarser downsampling( U = T / ) in the encoder and a transformer-based contextnetwork [11]. In addition, a quantization network is used tolearn a latent target sequence q = QUANTIZE ( z ) ∈ R U × D .Before feeding z to the context network, approximately halfof the U time steps are masked (i.e., replaced) by a learned D -dimensional vector. Given a context representation c u corre-sponding to a masked latent vector z u , the model is trained todistinguish the quantized target q u from distractors q d sam-pled uniformly from the other masked time steps. The set ofdistractor indices D includes the target index u : SIM ( q i , c j ) = q (cid:124) i c j (cid:107) q i (cid:107) (cid:107) c j (cid:107) (3) L u ( q , c u ) = − log e SIM ( q u , c u ) /κ (cid:80) d ∈D e SIM ( q d , c u ) /κ (4)where κ is a constant temperature. The total loss is obtainedby summing over all masked time steps. The model is alsotrained with an entropy-based diversity loss that encouragesequal use of the quantized representations. The masking pro-cedure allows for a context network consisting of multipletransformer layers that incorporate information from the fullsequence instead of only time steps prior to u . Thus, themasking feature is key in order to be able to use an archi-tecture well suited for fine-tuning.
3. BIDIRECTIONAL EXTENSION
The context network of the original wav2vec only uses in-formation prior to the offset latent vector z i + k . This avoidscollapsing to a trivial solution and allows for online process-ing of streaming data. In contrast, wav2vec 2.0 requires thecomplete sequence at once as input to the transformer-basedcontext network. If we consider this setting where onlineprocessing is not required, the original wav2vec model canbe extended with an additional context network that operatesbackward from time step U to . To train the backward net-work, the loss in equation 2 is adapted by replacing z i + k with z i − k . The total loss is obtained by summing the loss for thetwo context networks as illustrated in Fig. 2. The context net-works are independent, but trained jointly with the same en-coder. The representations used for downstream tasks are theconcatenation of the output from the two context networks.
4. EXPERIMENTS4.1. Data and pre-trained models
The original wav2vec model was trained on the 960 hours ofLibriSpeech dataset [12]. We trained our bidirectional exten-sion and baseline models on the same data. We used three pre- D params GPU days Log mel-spectrogram 99.6 99.7 66.5 82.0 33.8 57.5 No 80 - -wav2vec [1] 71.7 82.5 43.1 61.9 24.0 45.8 No 512 33M ? wav2vec 2.0 [3] BASE
Yes 768 95M 102.4
LARGE
VOX
Our work:
LSTM-UD-512 69.9 81.9 41.5 60.9 23.8 45.2 No 512 9.6M 3.8LSTM-UD-2x512 69.2 81.2 41.0 61.0 23.2 44.9 No 1024 18M 9.5LSTM-BD-2x512
Table 1 . Word error rates on the clean and other test sets of LibriSpeech for ASR models trained with representations extractedfrom wav2vec, wav2vec 2.0 and the authors’ proposed models. Results are without an external language model.
GPU days denotes training time multiplied by the number of GPUs.trained models from wav2vec 2.0:
BASE , LARGE and
VOX .The LARGE model is a deeper and wider version of the
BASE model. Both are trained on the 960 hour LibriSpeech. The
VOX model is identical to the
LARGE , but trained on 60.000hours of speech from the LibriLight dataset [13] which is anin-domain extension of LibriSpeech providing large quanti-ties of unlabeled data and a standardization of smaller subsetsfrom the original 960 hours training data. We trained ASRmodels on the 10 minute, 1 hour and 10 hour subsets of Lib-riLight for all representation models.
In addition to bidirectionality, we propose to use few filtersfor the first layer in the encoder network and then incremen-tally increase the number of filters as the temporal resolutionis lowered by striding. This significantly lowers the mem-ory footprint of the encoder by avoiding large representationswhile the temporal resolution is high. Thus, our encoder usessix 1D-convolutions with number of filters set to (64, 128,192, 256, 512, 512), kernel sizes (10, 8, 4, 4, 4, 1), and strides(5, 4, 2, 2, 2, 1). With a constant filter size of 512, mem-ory consumption would be 4.6 times higher. Each convolu-tional layer is followed by a group normalization layer with32 groups [14] and ReLU non-linearity clipped at the valueof 5. Instead of using convolutions as in wav2vec, we usedfour LSTM layers [15] with 512 units each for the contextnetwork. We sampled 120 seconds of audio for each batchand trained the model for 8 epochs on LibriSpeech. We usedAdam [16] with a fixed learning rate of · − for the firsthalf of training after which it was decayed to · − . We use K = 12 offsets and sample 10 distractors. (i.e., |D| = 10 ). https://github.com/pytorch/fairseq The model is trained on 16 GPUs, but training time is not stated. The proposed model is made available upon publication.
For the ASR model, we used the architecture from [17]trained with a connectionist temporal classification loss [18],which has shown state-of-the-art results on the small WallStreet Journal dataset [19]. The original model uses three lay-ers of 2D-convolutions followed by 10 bidirectional LSTMlayers with skip-connections and 320 units each. We re-placed the 2D-convolutions with 1D-convolutions as thereis no structure along the feature dimension of the learnedrepresentations. All 1D-convolutions used kernel size 3, had(640, 480, 320) units and strides (2, 1, 1). To account for thelower temporal resolution of the wav2vec 2.0 representation,strides were reduced to (1, 1, 1). We used the same optimizerand learning rate schedule as for the CPC models and batcheswere created by sampling up to 320 seconds of audio. Themodels were trained for 25k update steps on the 1 hour and 10hour subsets, but only for 10k update steps on the 10 minutesubset. Total training time was ∼ hours on a single GPUfor 25k updates. Results reported for the 10 minute modelsare averages over the 6 separate subsets of LibriLight.
5. RESULTS5.1. Training with wav2vec 2.0 representations
We found that ASR models trained on representations ex-tracted from the wav2vec 2.0 models had a tendency to getstuck in poor local minima. After confirming that values ofthe learned features followed a reasonable distribution, andthat tuning the learning rate did not solve the issue, we per-formed a principal component analysis (PCA) of the repre-sentations. We found that the wav2vec 2.0 representationsgenerally exhibited a low linear dimensionality , that is, onlyfew principal components are needed to explain the variancein the representations, see Fig. 3. Furthermore, the lineardimensionality of the representation decreased with model
256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o Log-mel spectrogram
0% - 90% (6 / 80 = 7.5%)90% - 99% (43 / 80 = 53.8%)99% - 100% 0 256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o wav2vec 2.0 - BASE
0% - 90% (89 / 768 = 11.6%)90% - 99% (369 / 768 = 48.0%)99% - 100%0 256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o LSTM-BD-2x512
0% - 90% (328 / 1024 = 32.0%)90% - 99% (832 / 1024 = 81.2%)99% - 100% 0 256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o wav2vec 2.0 - LARGE
0% - 90% (39 / 1024 = 3.8%)90% - 99% (260 / 1024 = 25.4%)99% - 100%0 256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o wav2vec
0% - 90% (117 / 512 = 22.9%)90% - 99% (326 / 512 = 63.7%)99% - 100% 0 256 512 768 1024Number of features0.000.250.500.751.00 E xp l a i n e d v a r i a n ce r a ti o wav2vec 2.0 - VOX
0% - 90% (29 / 1024 = 2.8%)90% - 99% (106 / 1024 = 10.4%)99% - 100%
Fig. 3 . Explained variance ratio as a function of the number of features after decorrelating the feature space with PCA. ThePCA transformation was computed on the 10 hour subset of LibriLight.complexity and the amount of training data. Indeed, the twolarge models were also the ones that consistently failed, whilethe
BASE model did converge on both the 1 hour and 10 hoursubsets. Feature decorrelation has previously proven useful inspeech classification tasks [20]. Training ASR models on thedecorrelated feature space, without reducing the number offeatures, solved the initial training issues. To ensure that themean normalization commonly performed prior to the PCAtransformation was not responsible for resolving the issue,we performed an ablation experiment where we only usedmean normalization on the raw features, but this did not al-leviate training issues. Representations from our models andwav2vec did not benefit from decorrelation.
Surprisingly, the
BASE representations consistently outper-formed representations from the two larger models, indicat-ing that the quality of the learned representations does notscale with model complexity for wav2vec 2.0. For the 1 hourand 10 hour subsets, representations from the
VOX model ledto better performance compared to the
LARGE model show-ing the benefits of the large increase in training data. Thistendency was blurred by the poor performance of both repre-sentations on the 10 minute subset. Compared to fine-tuningresults without a language model in Appendix C of [3], per-formance is severely degraded despite a strong ASR model.Focusing on the best performing wav2vec 2.0 representa-tions from
BASE , we observe a significant word error rate re-duction for the 10 hour subset compared to wav2vec. As theamount of training data is reduced, so is the difference. Forthe 10 minute subset, the picture is reversed as the wav2vecrepresentations performed better.
Our baseline model (LSTM-512), as expected, yielded repre-sentations on par with the wav2vec model. The bidirectionalextension (LSTM-BD-2x512) consistently improved word er-ror rate for all subsets. To ensure the improvement is notjust a result of increased model complexity, we trained an-other model that also used two separate context networks, butboth operating in the same direction (LSTM-UD-2x512). Al-though this model was slightly better than the baseline model,the bidirectional model was still superior across all subsets.Furthermore, the bidirectional model gave the best result onthe clean test set when trained on the 1 hour subset and forboth test sets when trained on the 10 minute subset.
6. CONCLUSIONS
We compared contrastive representations for ASR in the set-ting of limited training resources. We showed that represen-tations from the wav2vec 2.0 framework live in a low dimen-sional subspace. Using PCA to decorrelate the features al-leviated training issues for the speech recognizer. However,ASR models trained on the fixed wav2vec 2.0 representationsstill performed significantly worse than the fine-tuned ver-sions from the original wav2vec 2.0 work. Representationsfrom the first version of wav2vec, learned with a much lowercomputational cost, performed better than wav2vec 2.0 on the10 minute subset, but were inferior on the 10 hour subset. Weextended the original wav2vec framework with a context net-work operating backwards along the temporal dimension andconfirmed that bidirectionality can improve speech represen-tations used for ASR. . REFERENCES [1] Steffen Schneider, Alexei Baevski, Ronan Collobert,and Michael Auli, “wav2vec: Unsupervised pre-trainingfor speech recognition,” in
INTERSPEECH . ISCA,2019, pp. 3465–3469.[2] Alexei Baevski, Steffen Schneider, and Michael Auli,“vq-wav2vec: Self-supervised learning of discretespeech representations,” arXiv preprint:1910.05453 ,2019.[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,and Michael Auli, “wav2vec 2.0: A framework forself-supervised learning of speech representations,” in
Advances in Neural Information Processing Systems(NeurIPS) , 2020.[4] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and JamesGlass, “An unsupervised autoregressive model forspeech representation learning,” in
INTERSPEECH .ISCA, 2019, pp. 146–150.[5] Yu-An Chung and James Glass, “Generative pre-training for speech with autoregressive predictive cod-ing,” in
International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp.3497–3501.[6] Yu-An Chung, Hao Tang, and James Glass, “Vector-quantized autoregressive predictive coding,” in
INTER-SPEECH . ISCA, 2020, pp. 3760–3764.[7] Weiran Wang, Qingming Tang, and Karen Livescu,“Unsupervised pre-training of bidirectional speech en-coders via masked reconstruction,” in
InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 6889–6893.[8] Sameer Khurana, Antoine Laurent, Wei-Ning Hsu,Jan Chorowski, Adrian Lancucki, Ricard Marxer, andJames Glass, “A convolutional deep markov modelfor unsupervised speech representation learning,” arXivpreprint:2006.02547 , 2020.[9] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, AntonioBonafonte, and Yoshua Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in
INTERSPEECH . ISCA, 2019, pp.161–165.[10] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep-resentation learning with contrastive predictive coding,” arXiv preprint:1807.03748 , 2018.[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin, “Attention is all you need,” in
Advances in Neural Information Processing Systems(NeurIPS) , 2017, pp. 5998–6008.[12] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an ASR corpus basedon public domain audio books,” in
International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2015, pp. 5206–5210.[13] Jacob Kahn, Morgane Rivi`ere, Weiyi Zheng, EvgenyKharitonov, Qiantong Xu, Pierre-Emmanuel Mazar´e,Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert,Christian Fuegen, et al., “Libri-light: A benchmark forasr with limited or no supervision,” in
InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 7669–7673.[14] Yuxin Wu and Kaiming He, “Group normalization,” in
Proceedings of the European Conference on ComputerVision (ECCV) , 2018, pp. 3–19.[15] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”
Neural Computation , vol. 9, no. 8, pp.1735–1780, 1997.[16] Diederik P Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in
International Confer-ence on Learning Representations (ICLR) , 2015.[17] Lasse Borgholt, Jakob D. Havtorn, ˇZeljko Agi´c, AndersSøgaard, Lars Maaløe, and Christian Igel, “Do end-to-end speech recognition models care about context?,” in
INTERSPEECH . ISCA, 2020, pp. 4352–4356.[18] Alex Graves, Santiago Fern´andez, Faustino Gomez, andJ¨urgen Schmidhuber, “Connectionist temporal classifi-cation: Labelling unsegmented sequence data with re-current neural networks,” in
International Conferenceon Machine learning (ICML) . ACM, 2006, pp. 369–376.[19] Douglas B. Paul and Janet M. Baker, “The design forthe Wall Street Journal-based CSR corpus,” in
Proceed-ings of the Workshop on Speech and Natural Language .1992, p. 357–362, Association for Computational Lin-guistics.[20] W. Q. Zheng, J. S. Yu, and Y. X. Zou, “An experimentalstudy of speech emotion recognition based on deep con-volutional neural networks,” in