General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework
Yucheng Zhao, Dacheng Yin, Chong Luo, Zhiyuan Zhao, Chuanxin Tang, Wenjun Zeng, Zheng-Jun Zha
GGeneral-Purpose Speech Representation Learning through a Self-SupervisedMulti-Granularity Framework
Yucheng Zhao , Dacheng Yin , Chong Luo , Zhiyuan Zhao , Chuanxin Tang , WenjunZeng , Zheng-Jun Zha University of Science and Technology of China Microsoft Research Asia { lnc, ydc } @mail.ustc.edu.cn, { cluo, zhiyzh, chutan, wezeng } @microsoft.com, [email protected] Abstract
This paper presents a self-supervised learningframework, named MGF, for general-purposespeech representation learning. In the design ofMGF, speech hierarchy is taken into consideration.Specifically, we propose to use generative learningapproaches to capture fine-grained information atsmall time scales and use discriminative learningapproaches to distill coarse-grained or semantic in-formation at large time scales. For phoneme-scalelearning, we borrow idea from the masked languagemodel but tailor it for the continuous speech signalby replacing classification loss with a contrastiveloss. We corroborate our design by evaluatingMGF representation on various downstream tasks,including phoneme classification, speaker classifi-cation, speech recognition, and emotion classifica-tion. Experiments verify that training at differenttime scales needs different training targets and lossfunctions, which in general complement each otherand lead to a better performance.
Unsupervised pre-training, or representation learning, hasdrawn wide interests in both academia and industry. TheBERT model [Devlin et al. , 2019] has become a universalfeature extractor for solving a wide range of natural lan-guage processing (NLP) tasks. Recently, it is reported thatthe image embedding learned in an unsupervised mannerachieves comparable performance to its supervised coun-terparts in the image classification task [He et al. , 2020;Chen et al. , 2020]. Actually, most contemporary unsuper-vised pre-training methods adopt the self-supervised learningapproach. We use these two terms interchangeably in this pa-per to refer to methods that do not need human annotation.In the speech domain, pre-training is not a new con-cept. The speaker recognition task depends heavily on thesupervised pre-training step to obtain a good feature em-bedding. Recently, self-supervised learning is also used topre-train dedicated models for automatic speech recogni-tion (ASR) [Schneider et al. , 2019; Baevski et al. , 2020a;Baevski et al. , 2020b; Ling et al. , 2020]. In this work, how-ever, we are not focusing on these task-oriented pre-training.
The meeting is now adjourned frames: 10ms sentence:1446 ms/jh/:50ms/m/:80ms
Figure 1: Speech hierarchy. The waveform is sampled at 16K Hz.Sample points within a 10ms segment form a frame, which is thebasic operating unit in many speech algorithms. Phonemic informa-tion can be extracted from several frames, as we illustrate with redboxes. The whole sentence lasts for more than one second.
Instead, we aim to pre-train a general-purpose feature extrac-tor which embeds a speech signal into a feature representationthat could be used for a variety of downstream speech tasks,in a way similar to how pre-trained language and image rep-resentations are used in their respective domains.The main difficulty in learning a general-purpose speechrepresentation is that speech carries complex hierarchicalstructure (samples, phonemes, and sentences) which containsrelevant information at different time scales [Pascual et al. ,2019]. In this work, we propose a Multi-Granularity Frame-work, named MGF, to train the model at multiple time scales.A key innovation in MGF is to adopt different learning ap-proaches for the learning at different time scales. In particu-lar, we use generative approaches to capture fine-grained in-formation for small time scales on the order of a few millisec-onds, and we adopt discriminative approaches to distill se-mantic information for large time scales which correspond toa phoneme and a sentence. In order to realize phoneme-levelcontrastive learning, we extend the token-oriented maskedlanguage model (MLM) model [Devlin et al. , 2019] to contin-uous masked language model (cMLM) to accommodate thecontinuous speech signals without token boundaries. MGF isimplemented by a deep bidirectional Transformer [Vaswani et al. , 2017; Devlin et al. , 2019].We evaluate the MGF representation on multiple down-stream tasks and benchmark datasets, which becomes the sec-ond main contribution of our work. The performance of MGFis first evaluated on phoneme classification and speaker clas- a r X i v : . [ c s . S D ] F e b ification tasks, following the other general-purpose speechrepresentation learning work [van den Oord et al. , 2018;Liu et al. , 2020b]. We find that features learned by MGF isvery powerful on these two orthogonal tasks. On the Lib-riSpeech dataset, MGF representation achieves a phonemeclassification accuracy of 73.4% under linear evaluation, sur-passing the existing unsupervised pre-training methods by alarge margin. On the speaker classification task, MGF repre-sentation is the first to achieve an accuracy of 100%.We further evaluate MGF in other three downstream tasks.First, in view of the saturated performance in speaker classi-fication, we propose a new and harder task named one-shotspeaker classification , where only one utterance per speakeris provided in the fine-tuning stage. In this task, MGF isevaluated against the well-known x-vector and d-vector andis shown to achieve better performance. Second, we com-pare MGF with a task-specific pre-training model wav2vecin the ASR task. Third, we test MGF representation on theIEMOCAP emotion classification task. Surprisingly, simplyappending a fully-connected layer after MGF achieves the topperformance among all existing audio-based approaches. There are two camps of self-supervised learning approaches,namely discriminative and generative approaches. We willfirst review these two approaches for speech pre-training, andthen discuss other related work that motivates MGF.
Discriminative approaches acquire supervision signal fromthe contrastive distance between a selected positive sampleand several negative samples. By carefully designing thetraining target and the data sampling procedure, samples canbe automatically labelled.Contrastive predictive coding (CPC) [van den Oord et al. ,2018] is a contrastive learning method based on predicting thefuture in the latent space. The representations of temporallynearby segments are treated as positive samples while thoseof temporally distant segments are treated as negative sam-ples. However, one could easily find a counter example inspeech processing. For example, a word appears twice in anutterance with the same meaning. When the first appearanceis the anchor, the second appearance should not be treatedas a negative sample no matter how far it is. Previous work[Chung et al. , 2019] also notices that the choice of negativesamples in CPC has huge effect on its performance on thephoneme classification task.While CPC itself is a general-purpose speech pre-trainingmethod, it can be leveraged in some task-specific pre-trainingmodels, such as wav2vec [Schneider et al. , 2019], vq-wav2vec [Baevski et al. , 2020a], and wav2vec 2.0 [Baevski etal. , 2020b]. Vq-wav2vec proposes a quantization algorithmso that wav2vec (which adopts CPC) can be combined withthe BERT model [Devlin et al. , 2019] to achieve better per-formance. Wav2vec 2.0 improves vq-wav2vec by training theentire model end-to-end. It also uses a very large unlabelleddataset for pre-training. These task-specific pre-train modelsare very powerful in their target task, but perform poorly inother speech tasks.
Generative approaches learn to reconstruct signal in the inputspace or features in some latent spaces. Training is supervisedby the reconstruction loss. Autoregressive predictive coding(APC) [Chung et al. , 2019] uses an autoregressive model toencode the history and predict the future. A follow-up work[Chung and Glass, 2020] adds an auxiliary objective whichencourages the model to additionally remember the past. De-CoAR [Ling et al. , 2020] borrows the bidirectional learningidea from ELMo [Peters et al. , 2018] so that it can learn deepcontextualized acoustic representations for semi-supervisedspeech recognition.Inspired by the MLM proposed in BERT [Devlin et al. ,2019], recent works [Liu et al. , 2020b; Liu et al. , 2020a] haveexplored using BERT-style objective in speech pre-training.In Mockingjay [Liu et al. , 2020b], part of input frames aremasked to zeros and the pre-trained encoder is required topredict the masked frame from its neighborhood. TERA [Liu et al. , 2020a] extends Mockingjay by introducing channel al-teration and magnitude alteration.
PASE [Pascual et al. , 2019] uses multiple regressors and dis-criminators to learn a problem-agnostic speech encoder. An-other work PASE+ [Ravanelli et al. , 2020] improves PASEfor robust speech recognition in noisy and reverberant envi-ronments by introducing data augmentation, more regressiontasks, and a collection of architecture modification.Our work and PASE both consider combinations of gener-ative and discriminative objectives. However, PASE does notconsider speech hierarchy. In our work, different objectivesare used to handle signals at different time scales.
Our work is inspired by some self-supervised learning meth-ods in other domains. BERT [Devlin et al. , 2019] is a mile-stone work for pre-training in NLP. The core of BERT is theMLM, where some input tokens are randomly masked out,and the training objective is to predict the vocabulary ID ofthe masked word based only on its context. SimCLR [Chen etal. , 2020] proposes a simple contrastive learning frameworkfor visual representation learning. It adopts the contrastiveloss between augmented views of an image without relyingon specialized architecture design or memory bank mecha-nism. BERT and SimCLR inspired our phoneme-scale andsentence-scale contrastive learning, respectively.
The main idea behind MGF is to extract the informationattached to each speech hierarchy by multi-granularity ob-jectives. At the finest granularity, we adopt generative ap-proaches to reconstruct the original waveform and selectedhand-crafted features to extract sample-scale and frame-scaleinformation, respectively. At a coarser granularity, we designa novel continuous masked language model (cMLM) whichmasks several consequent frames with the typical phoneme a w W a v e . Recon. Wave. E n c o d e r DecoderHeadHeadHeadHead C r o p W a v e . A u g . A u g . C r o p W a v e . MFCCfMLLR
Figure 2: A sketch of the multi-granularity framework for self-supervised speech representation learning. Different heads are notweight-sharing. Recon.: Reconstructed. Aug.: Augmentation.Wave.: Waveform. length. The model is trained to estimate the feature embed-ding for one of the masked frames based on the context infor-mation. We do not expect the model to recover the exact fea-tures, but we hope that the estimated feature for the maskedframe is close to the ground truth feature and far away fromfeatures of other frames in different phonemes. At the sen-tence level, learning encourages segments within the samesentence to have close representations and segments acrossdifferent sentences to have representations that are far apart.
Learning of sample-scale and frame-scale information adoptsreconstruction loss, while learning of phoneme-scale andsentence-scale information adopts contrastive loss.
Sample-Scale Loss
As Fig.2 shows, a decoder is appended to the base moduleto reconstruct the original signal from the feature embed-ding. Let x denote the original signal and ¯ x denote the re-constructed signal. The sample-scale loss is implemented bythe scale-invariant signal-to-distortion ratio (SI-SDR) [Roux et al. , 2019] loss which is formulated as: L sample = −
10 log ( | αx | | αx − ¯ x | ) for α = ¯ x T x | x | . (1)SI-SDR loss is widely used in speech separation [Zhao et al. ,2020] and we empirically find it works better than L1 loss. Frame-Scale Loss
Several heads composed of two convolutional (conv) layersare appended to the base model to generate selected hand-crafted features frame by frame. These features, includ-ing log power spectrum (LPS), mel-frequency cepstral coef-ficient (MFCC), and maximum likelihood linear regression(fMLLR), have been proven effective in speech-related tasks.Following PASE+, we set the context window for LPS andMFCC to both 25ms and 400 ms. L2 loss is used as the opti-mization target and it is formulated as: L frame = E u ∈U (cid:88) h ∈H ω h || u h − ˆ u h || (2)where U is the set of unmasked frames, H is the set of hand-crafted feature indicators, ω h is the weight of h ( h ∈ H ), and u h , ˆ u h are the ground-truth and the estimation of feature h forframe u , respectively. Phoneme-Scale Loss
A speech segment that is tens to hundreds of millisecondslong contains distinguishable phonemes. We believe thatlearning distinguishable high-level semantic information ismore important than waveform or feature reconstruction onthis time scale.Phoneme-scale information is learned by our proposedcontinuous MLM model and the discriminative learning ob-jective. In the vanilla MLM, discrete tokens from a finite dic-tionary are masked in the language input and then predicted inthe output based on their context. However, speech is a con-tinuous signal without token boundaries. It is not possible toprecisely mask some pre-defined phonemes. To address thischallenge, we randomly mask a fixed-length speech segment,and use InfoNCE [van den Oord et al. , 2018] loss to evalu-ate the quality of the estimated features of a masked frame.In our implementation, each masked segment has a length of140ms, and the total length of masked segments does not ex-ceed 20% of the input speech crop. The masked segment isreplaced by non-speech noise. We empirically find that it is abetter choice than a segment of zeros or a random speech.InfoNCE loss directly operates on real-valued feature vec-tors and is formulated as: L phoneme = − E v ∈V log exp ( v T ˆ v/τ ) (cid:80) Kk =1 exp ( v T v k /τ ) + exp ( v T ˆ v/τ ) (3)where V is the set of masked frames, v is the anchor sample, ˆ v is the positive sample, and v k ( k = 1 , ..., K ) are negativesamples. τ is a temperature parameter.Note that each sample is the feature representation of asingle frame whose duration is 10ms. The anchor sampleis drawn from the MGF representation of the masked crop,while the positive sample and negative samples are drawnfrom the feature representation of the original unmasked crop.In other words, the anchor sample is the estimated featurewhile positive and negative samples are ground-truth featuresat the same or different locations, respectively. Sentence-Scale Loss
Sentence-scale loss focuses on capturing semantic informa-tion relevant to a long time scale. We adopt SimCLR frame-work [Chen et al. , 2020], which was proposed for visual rep-resentation learning, to implement the idea. We make somesmall modifications to SimCLR. First, since cropping is use-ful only in computing sentence-scale loss, we crop two seg-ments of a sentence before applying other data augmenta-tions. Then, one segment is treated as the original crop forall the other objectives. Second, we use data augmentationsspecific to speech signal. The augmentations include tempo-ral mask and additive noise.The sentence-scale loss is defined as: L sentence = − log exp ( z Ti z j /τ ) (cid:80) N − k =1 I [ k (cid:54) = j ] exp ( z Ti z k /τ ) (4)where z i is the anchor sample, z j is the positive sample, z k ( k (cid:54) = j ) are negative samples, N is batch-size, and τ isa temperature parameter. Each sample is obtained by aver-aging the MGF representation of all frames within a 2s-longpeech crop and passing the averaged feature through a headof two conv layers. The positive sample corresponds to aspeech crop from the same sentence as the anchor crop, whilenegative samples correspond to crops from other sentences.Following SimCLR, we draw negatives from the mini-batchso that there are N − negative samples for each anchor. Multi-Granularity Objectives
We train the base encoder by combining multi-granularity ob-jectives. The total loss function is defined as: L = λ L sample + λ L frame + λ L phoneme + λ L sentence (5)where λ i , i = 1 , , , are weights of each loss. We tune eachweight in { . , . , . , } independently. MGF is implemented in PyTorch. The base encoder is com-posed of three conv layers and six blocks of Transformer. Thefirst conv layer has a kernel size of 320 and implements 512filters with stride 160 and padding 80. The second conv layerhas a kernel size of 1 followed by ReLU activation. Thesetwo conv layers serve as a stem which transforms the time-domain waveform to a compact feature vector, so that Trans-former does not need to handle very long and low-level input.The third conv layer aligns the dimension with the subsequentTransformer. In addition to the base encoder, the decoder forsample-level reconstruction also uses four blocks of Trans-former. All Transformers share the same parameters with hid-den size d model = 768 , feed-forward size d ff = 3072 , and thenumber of attention heads h = 12 . In most of our experiments, we use the train-clean-100 sub-set of the LibriSpeech corpus [Panayotov et al. , 2015] as thepre-training dataset. It contains 100 hours of speech data. Weuse dev-clean subset as the validation dataset for model se-lection. In the pre-training stage, we use the raw signal onlyand ignore any human labels such as speaker ID or transcrip-tions. Some of the MGF objectives rely on data augmentationof additive noise. We use DNS challenge dataset [Reddy etal. , 2020], which contains 70k clips of non-speech noise, forthis purpose.We evaluate MGF representation in a wide range of down-stream tasks. Except for the speech recognition task, lin-ear evaluation is adopted, where the pre-trained model is ap-pended with only one learnable layer. The classification accu-racy is used as the evaluation metric. A majority of general-purpose speech pre-training work are evaluated on phonemeand speaker classification tasks. We use these two tasks forablation studies in addition to the system comparison withmethods in the same category. We also evaluated MGF inother downstream tasks to show its generalization capability.
Phoneme classification : We follow the setup in CPC[van den Oord et al. , 2018] to use 41 phoneme classes andthe train-clean-100 subset of LibriSpeech for both trainingand testing. For a fair comparison, we use aligned phone la-bels and train/test split provided by CPC.
Speaker classification : Following CPC, we first use Lib-riSpeech train-clean-100 subset containing 251 speakers forthis task. Using the same train/test split provided by CPC, theproposed MGF achieves a classification accuracy of 100%.In view of the saturated performance, we propose a new set-ting called one-shot speaker classification . We use the train-clean-360 subset which contains 921 speakers. For eachspeaker, we only put one utterance into the training set andsample 20% from the rest utterances into the test set. Thiscreates 3.2 hours of training data and 72 hours of testing data.
Emotion classification : We use interactive emotionaldyadic motion capture (IEMOCAP) dataset [Busso et al. ,2008] for this task. This corpus consists of five sessions withtwo speakers in each session. Following the usage in [Wu etal. , 2019], we evaluate the MGF representation on four emo-tions, namely
Neutral, Angry, Happy and Sad . The IEMO-CAP dataset contains scripted data and improvised data andwe only use the latter. We report results of five-fold cross val-idation using four sessions as training and the other sessionas validation and testing.
Speech Recognition : We use Wall Street Journal (WSJ[Garofolo et al. , 1993]) dataset for the ASR task. This corpuscomprises about 81 hours of transcribed audio data. We trainon si284, validate on nov93dev and test on nov92. We usethe lexicon-free [Likhomanenko et al. , 2019] acoustic modeland 4-gram KenLM [Heafield et al. , 2013] language modelwhich are implemented by wav2letter++ [Pratap et al. , 2019].Word error rate (WER) and letter error rate (LER) are used asevaluation metrics. We use the training recipe provided bywav2letter++ and only modifies the input embedding.For the pre-training and classification downstream tasks,we use Adam optimizer with warm-up to update the model.We use learning rate of { } , warm-up steps of { } and batch-sizeof { } for pre-training, phoneme classifi-cation, speaker classification, speaker verification and emo-tion classification training, respectively. We also exponen-tially decay the learning rate with exponent of 0.3. We use 4V100 GPU in both pre-training and downstream task finetun-ing. The total training epochs for pre-training is set to 300,and more epochs yield slight improvement. For all down-stream task finetuning, we set total training epochs to 100except the experiments in data efficiency. We first present ablation study of MGF to show the effective-ness of multi-granularity objectives and cMLM. We reportaccuracies on phoneme classification and one-shot speakerclassification tasks.
Multi-Granularity Framework
We study whether all four objectives at different time scalescontribute to the final accuracy of MGF, and assess their re-spective importance on two target problems. We trained MGFfour times, discarding one of the four loss objectives at a time.Results are presented in Table 1. The first row presents the re-sults of the full model and the other rows present the resultswhen a certain objective is discarded.The first finding is that every objective matters. Discardingbjectives Phoneme Acc One-Shot Speaker AccMGF (Full) 73.4 82.7-Sample 73.3 (-0.1) 82.2 (-0.5)-Frame 69.2 (-4.2) 79.9 (-2.8)-Phoneme 66.1 (-7.3) 81.3 (-1.4)-Sentence 73.4 (-0.0) 81.8 (-0.9)
Table 1: Classification accuracies on phoneme classification andone-shot speaker classification using MGF representation. First roware results of full model. Following rows report results when dis-carding each objective. Numbers in the parenthesis emphasize ac-curacy drop.
Approach Phoneme Acc One-Shot Speaker AccDiscriminative 73.4 82.7Generative 66.1 81.3
Table 2: Comparison of discriminative and generative approachesfor phoneme-scale training. any objective leads to notable accuracy drop in at least onetask. Second, while some objectives have general impact ontwo tasks, others turn out to be more task-oriented. For ex-ample, sample-scale loss and frame-scale loss are generallyhelpful. This is consistent with our design as these two losseslearn low-level problem-agnostic information. The frame-scale loss is specially important as it injects human priorknowledge into the model. The other two losses, however,are more task-dependent. Phoneme-scale loss has a remark-able impact on phoneme classification task (+27.4% in rela-tive error) and sentence-scale loss only contribute to one-shotspeaker classification task. It is worth noting that phoneme-scale loss also contributes a lot to the one-shot speaker clas-sification task. This is caused by the sampling strategy usedin cMLM, which will be described next. cMLM
MGF uses cMLM and adopts InfoNCE loss for phoneme-scale target. Previous work [Liu et al. , 2020b] has investi-gated a similar masking approach but has used a generativeapproach with reconstruction loss at a similar time scale. Webelieve that our proposed discriminative approach with In-foNCE loss is more suitable for this time scale. To validatethis, we implemented a generative approach which calculatesL1 loss between the predicted frame and the ground-truthframe in the masked segment. The two rows in Table 2 showthe results of using discriminative (InfoNCE) loss and gener-ative (L1) loss, respectively. The model trained with discrim-inative approach achieves 7.3% accuracy gain on phonemeclassification and 1.4% accuracy gain on one-shot speakerclassification, compared with the model trained with genera-tive approach. Coincidentally, the speaker classification accu-racy achieved by the generative approach (81.3%) is the sameas MGF without phoneme-scale objective. In other words, us-ing generative approach for phoneme-scale learning does nothelp the speaker classification task at all.It is worth noting that the sample strategy in cMLM has Method Phoneme Acc Speaker AccMFCC 39.7 17.6CPC 65.5 97.4Mockingjay 64.3 96.1TERA-base (3xT) 65.1 99.2TERA-medium (6xT) 65.9 -MGF (6xT) 73.4 100.0
Table 3: Comparison of self-supervised speech representation learn-ing methods on Librispeech phoneme and speaker classificationtasks under linear evaluation. n xT denotes n Transformer blocks. notable impact on the performance of MGF representation.There are two options to choose negative samples in cMLM:from a different sentence or from the same sentence as thepositive sample. Experimental results show that samplingfrom a different sentence leads to a better performance (1.2%and 0.9% accuracy gain on the two tasks, respectively). Thereason is that a richer vocabulary is helpful for the model tolearn more discriminative features. In addition, the different-sentence sample strategy allows the model to learn featuresthat are able to discriminate speakers.
We compare MGF with CPC [van den Oord et al. , 2018],Mockingjay [Liu et al. , 2020b], and TERA [Liu et al. , 2020a]on phoneme classification and speaker classification tasks.All the systems use the same setup as specified in CPC. Re-sults are shown in Table 3.CPC [van den Oord et al. , 2018] is a discriminative ap-proach. We have pointed out earlier that it is not appro-priate in CPC to distinguish positive and negative samplesonly based on the distance to the anchor. In contrast, MGFuses a masked model. When the input crop is masked or un-masked, features at the same location always form a positivepair. In addition, MGF is a multi-granularity framework anduses bidirectional model. These advances explain why MGFoutperforms CPC by a large margin on both tasks. Note thatthe evaluated MGF model has 12G FLOPs, which is a bitheavier than CPC which has 9.7G FLOPs.Mockingjay [Liu et al. , 2020b] and TERA [Liu et al. ,2020a] are generative approaches which try to predict acous-tic frames from its manipulated version. Mockingjay onlyuses temporal alteration and TERA extends it by addingchannel alteration and magnitude alteration. In the previ-ous section, we have tried similar generative loss and findthat discriminative loss works better. Table 3 shows thatMGF gets 9.1%/8.3% accuracy gain on phoneme classifica-tion and 3.9%/0.8% accuracy gain on speaker classificationover Mockingjay and TERA, respectively.
One-Shot Speaker Classification
We additionally evaluate MGF representation against twowell-known feature embeddings, namely d-vector [Wan et al. ,2018] and x-vector [Snyder et al. , 2018], in one-shot speakerethod nov93dev nov92LER WER LER WERBaseline 3.50 8.57 2.09 5.42wav2vec-large 2.91 7.24 1.64 4.48MGF-960 3.07 7.58 1.78 4.87
Table 4: WSJ speech recognition results.
Method Modality Emotion AccM3ER AVT 82.7CNN LSTM A 68.8CNN GRU-SeqCap A 72.7MGF-Scratch A 64.1MGF-Fixed A 71.2MGF-Finetune A 73.1
Table 5: IEMOCAP emotion classification accuracies of differentmethods. A, V, T are shorts for audio, video and text, respectively. classification task. d-vector is learned via a generalized end-to-end loss, which is similar to the triplet loss. x-vectoris learned via a speaker recognition task using a time-delayDNN. In particular, x-vector is recognized as the state-of-the-art embedding for speaker classification tasks. d-vectorand x-vector are both pre-trained on Switchborad [Godfrey etal. , 1992] and NIST SREs [Doddington et al. , 2000] datasetsand x-vector is implemented officially via Kaldi’s V2 recipe[Povey et al. , 2011].We use the same linear evaluation protocol for all the meth-ods to ensure a fair comparison. d-vector achieves 77.8% ac-curacy and x-vector achieves 79.6% accuracy. As a compar-ison, MGF achieves 82.7% accuracy which reduces the rela-tive error by 15.2% compared with the best counterpart.
Speech Recognition
We evaluate the performance of an ASR system built on topof MGF representation. The baseline method uses 80 log-mel filterbank coefficients with a 25ms sliding window and10ms stride. We also compare MGF with a well-known ASR-oriented self-supervised method wav2vec [Schneider et al. ,2019]. We use their released wav2vec-large checkpoint .As wav2vec is pre-trained with the entire 960 hours of Lib-riSpeech training set, we train the model MGF-960 with thesame amount of training data to ensure a fair comparison.MGF-960 has the same architecture and model size as ourbase model. As shown in Table 4, MGF-960 achieves signifi-cantly lower WER than the baseline and comparable WER aswav2vec-large. This means, without bells and whistles, ourgeneral-purpose speech representation can benefit the speechrecognition task. Emotion Classification
We present experimental results of emotion classification onIEMOCAP dataset. We want to use this task to evaluate theadaptation capability of MGF representation. Since this task https://github.com/pytorch/fairseq/tree/master/examples/wav2vec − Data Usage (%) P hon e m e A cc u r ac y ( % ) train-from-scratchpre-trained Figure 3: Comparison of representations with phoneme classifica-tion accuracy across different amount of labeled data. is never evaluated by previous self-supervised representationlearning methods, we compare MGF with state-of-the-art su-pervised methods instead. M3ER [Mittal et al. , 2020] usestext, audio and video to predict speaker’s emotion. CNNGRU-SeqCap [Wu et al. , 2019], CNN LSTM [Satt et al. ,2017], and our MGF only use audio. We set three differentsettings for MGF. MGF-Scratch does not have pre-trainingstage while MGF-Fixed and MGF-Finetune are pre-trained.The base encoder of MGF is not trainable in `fixed” andtrainable in `Finetune”. As shown in Table 5, MGF-Scratchdoes not work well but MGF-Fixed and MGF-Finetune bothachieve high accuracy, showing that pre-training does do a lothelp. MGF-Fintune even creates a new SOTA among audio-only methods.
Last but not least, we show how pre-training could help inlow-resource scenarios where human labels are scarce. Weuse LibriSpeech phoneme classification task and reduce thelabeled data usage from 100% to 0.1%. The performance ofMGF in different settings are plotted in Figure 3. To comparefairly with the model without pre-training, we open up the en-tire model for fine-tuning. We find that pre-trained MGF out-performs its train-from-scratch counterpart by a large marginwhen the length of labeled data is less than one hour. In anextreme low-resource scenario where only six minutes of la-beled data are available, the pre-trained model still achievesa reasonably good performance of 72.3% phoneme accuracywhile the train-from-scratch model is only able to achieve34.9% phoneme accuracy.
We have proposed a multi-granularity framework for self-supervised speech representation learning. By taking thespeech hierarchy into consideration, MGF achieves top per-formance among existing speech pre-training methods on acollection of speech tasks. Comprehensive ablation studieshave been carried out to demonstrate the effectiveness of ourdesign in MGF. In the future, we plan to expand this multi-granularity self-supervised framework to the image domain,which may benefit tasks that demand multi-scale features. eferences [Baevski et al. , 2020a] Alexei Baevski, Steffen Schneider, andMichael Auli. vq-wav2vec: Self-supervised learning of discretespeech representations. In
ICLR , 2020.[Baevski et al. , 2020b] Alexei Baevski, Yuhao Zhou, AbdelrahmanMohamed, and Michael Auli. wav2vec 2.0: A framework forself-supervised learning of speech representations. In
NeurIPS ,2020.[Busso et al. , 2008] Carlos Busso, Murtaza Bulut, Chi-Chun Lee,Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N.Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP:interactive emotional dyadic motion capture database.
Lang. Re-sour. Evaluation , 2008.[Chen et al. , 2020] Ting Chen, Simon Kornblith, MohammadNorouzi, and Geoffrey E. Hinton. A simple framework for con-trastive learning of visual representations. In
ICML , 2020.[Chung and Glass, 2020] Yu-An Chung and James R. Glass. Im-proved speech representations with multi-target autoregressivepredictive coding. In
ACL , 2020.[Chung et al. , 2019] Yu-An Chung, Wei-Ning Hsu, Hao Tang, andJames R. Glass. An unsupervised autoregressive model forspeech representation learning. In
INTERSPEECH , 2019.[Devlin et al. , 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee,and Kristina Toutanova. BERT: pre-training of deep bidirectionaltransformers for language understanding. In
NAACL-HLT (1) ,2019.[Doddington et al. , 2000] George R. Doddington, Mark A. Przy-bocki, Alvin F. Martin, and Douglas A. Reynolds. The NISTspeaker recognition evaluation - overview, methodology, sys-tems, results, perspective.
Speech Commun. , 2000.[Garofolo et al. , 1993] John Garofolo, David Graff, Doug Paul, andDavid Pallett. Csr-i (wsj0) complete ldc93s6a.
Web Download.Philadelphia: Linguistic Data Consortium , 1993.[Godfrey et al. , 1992] John J. Godfrey, Edward Holliman, and JaneMcDaniel. SWITCHBOARD: telephone speech corpus for re-search and development. In
ICASSP , 1992.[He et al. , 2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie,and Ross B. Girshick. Momentum contrast for unsupervised vi-sual representation learning. In
CVPR , 2020.[Heafield et al. , 2013] Kenneth Heafield, Ivan Pouzyrevsky,Jonathan H. Clark, and Philipp Koehn. Scalable modifiedkneser-ney language model estimation. In
ACL (2) , 2013.[Likhomanenko et al. , 2019] Tatiana Likhomanenko, Gabriel Syn-naeve, and Ronan Collobert. Who needs words? lexicon-freespeech recognition. In
INTERSPEECH , 2019.[Ling et al. , 2020] Shaoshi Ling, Yuzong Liu, Julian Salazar, andKatrin Kirchhoff. Deep contextualized acoustic representationsfor semi-supervised speech recognition. In
ICASSP , 2020.[Liu et al. , 2020a] Andy T. Liu, Shang-wen Li, and Hung-yi Lee.TERA: self-supervised learning of transformer encoder represen-tation for speech.
CoRR , abs/2007.06028, 2020.[Liu et al. , 2020b] Andy T. Liu, Shu-Wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speechrepresentation learning with deep bidirectional transformer en-coders. In
ICASSP , 2020.[Mittal et al. , 2020] Trisha Mittal, Uttaran Bhattacharya, RohanChandra, Aniket Bera, and Dinesh Manocha. M3ER: multiplica-tive multimodal emotion recognition using facial, textual, andspeech cues. In
AAAI , 2020. [Panayotov et al. , 2015] Vassil Panayotov, Guoguo Chen, DanielPovey, and Sanjeev Khudanpur. Librispeech: An ASR corpusbased on public domain audio books. In
ICASSP , 2015.[Pascual et al. , 2019] Santiago Pascual, Mirco Ravanelli, JoanSerr`a, Antonio Bonafonte, and Yoshua Bengio. Learningproblem-agnostic speech representations from multiple self-supervised tasks. In
INTERSPEECH , 2019.[Peters et al. , 2018] Matthew E. Peters, Mark Neumann, MohitIyyer, Matt Gardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. Deep contextualized word representations. In
NAACL-HLT , 2018.[Povey et al. , 2011] Daniel Povey, Arnab Ghoshal, Gilles Bou-lianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al.The kaldi speech recognition toolkit. In
WASRU , 2011.[Pratap et al. , 2019] Vineel Pratap, Awni Hannun, Qiantong Xu,Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,and Ronan Collobert. Wav2letter++: A fast open-source speechrecognition system. In
ICASSP , 2019.[Ravanelli et al. , 2020] Mirco Ravanelli, Jianyuan Zhong, SantiagoPascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, andYoshua Bengio. Multi-task self-supervised learning for robustspeech recognition. In
ICASSP , 2020.[Reddy et al. , 2020] Chandan K. A. Reddy, Vishak Gopal, RossCutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey,Sergiy Matusevych, Robert Aichner, Ashkan Aazami, SebastianBraun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke.The INTERSPEECH 2020 deep noise suppression challenge:Datasets, subjective testing framework, and challenge results.
CoRR , abs/2005.13981, 2020.[Roux et al. , 2019] Jonathan Le Roux, Scott Wisdom, Hakan Erdo-gan, and John R. Hershey. SDR - half-baked or well done? In
ICASSP , 2019.[Satt et al. , 2017] Aharon Satt, Shai Rozenberg, and Ron Hoory.Efficient emotion recognition from speech using deep learningon spectrograms. In
INTERSPEECH , 2017.[Schneider et al. , 2019] Steffen Schneider, Alexei Baevski, RonanCollobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. In
INTERSPEECH , 2019.[Snyder et al. , 2018] David Snyder, Daniel Garcia-Romero, Gre-gory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors:Robust DNN embeddings for speaker recognition. In
ICASSP ,2018.[van den Oord et al. , 2018] A¨aron van den Oord, Yazhe Li, andOriol Vinyals. Representation learning with contrastive predic-tive coding.
CoRR , abs/1807.03748, 2018.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, Niki Par-mar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. In
NIPS ,2017.[Wan et al. , 2018] Li Wan, Quan Wang, Alan Papir, and IgnacioLopez-Moreno. Generalized end-to-end loss for speaker verifi-cation. In
ICASSP , 2018.[Wu et al. , 2019] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li,Jianwei Yu, Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu,Xunying Liu, and Helen Meng. Speech emotion recognition us-ing capsule networks. In
ICASSP , 2019.[Zhao et al. , 2020] Yucheng Zhao, Chong Luo, Zheng-Jun Zha, andWenjun Zeng. Multi-scale group transformer for long sequencemodeling in speech separation. In