[PDF] General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

Abstract

This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. In the design of MGF, speech hierarchy is taken into consideration. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales. For phoneme-scale learning, we borrow idea from the masked language model but tailor it for the continuous speech signal by replacing classification loss with a contrastive loss. We corroborate our design by evaluating MGF representation on various downstream tasks, including phoneme classification, speaker classification, speech recognition, and emotion classification. Experiments verify that training at different time scales needs different training targets and loss functions, which in general complement each other and lead to a better performance.

Full PDF

GGeneral-Purpose Speech Representation Learning through a Self-SupervisedMulti-Granularity Framework

Yucheng Zhao , Dacheng Yin , Chong Luo , Zhiyuan Zhao , Chuanxin Tang , WenjunZeng , Zheng-Jun Zha University of Science and Technology of China Microsoft Research Asia { lnc, ydc } @mail.ustc.edu.cn, { cluo, zhiyzh, chutan, wezeng } @microsoft.com, [email protected] Abstract

This paper presents a self-supervised learningframework, named MGF, for general-purposespeech representation learning. In the design ofMGF, speech hierarchy is taken into consideration.Speciﬁcally, we propose to use generative learningapproaches to capture ﬁne-grained information atsmall time scales and use discriminative learningapproaches to distill coarse-grained or semantic in-formation at large time scales. For phoneme-scalelearning, we borrow idea from the masked languagemodel but tailor it for the continuous speech signalby replacing classiﬁcation loss with a contrastiveloss. We corroborate our design by evaluatingMGF representation on various downstream tasks,including phoneme classiﬁcation, speaker classiﬁ-cation, speech recognition, and emotion classiﬁca-tion. Experiments verify that training at differenttime scales needs different training targets and lossfunctions, which in general complement each otherand lead to a better performance.

Unsupervised pre-training, or representation learning, hasdrawn wide interests in both academia and industry. TheBERT model [Devlin et al. , 2019] has become a universalfeature extractor for solving a wide range of natural lan-guage processing (NLP) tasks. Recently, it is reported thatthe image embedding learned in an unsupervised mannerachieves comparable performance to its supervised coun-terparts in the image classiﬁcation task [He et al. , 2020;Chen et al. , 2020]. Actually, most contemporary unsuper-vised pre-training methods adopt the self-supervised learningapproach. We use these two terms interchangeably in this pa-per to refer to methods that do not need human annotation.In the speech domain, pre-training is not a new con-cept. The speaker recognition task depends heavily on thesupervised pre-training step to obtain a good feature em-bedding. Recently, self-supervised learning is also used topre-train dedicated models for automatic speech recogni-tion (ASR) [Schneider et al. , 2019; Baevski et al. , 2020a;Baevski et al. , 2020b; Ling et al. , 2020]. In this work, how-ever, we are not focusing on these task-oriented pre-training.

The meeting is now adjourned frames: 10ms sentence:1446 ms/jh/:50ms/m/:80ms

Figure 1: Speech hierarchy. The waveform is sampled at 16K Hz.Sample points within a 10ms segment form a frame, which is thebasic operating unit in many speech algorithms. Phonemic informa-tion can be extracted from several frames, as we illustrate with redboxes. The whole sentence lasts for more than one second.

Instead, we aim to pre-train a general-purpose feature extrac-tor which embeds a speech signal into a feature representationthat could be used for a variety of downstream speech tasks,in a way similar to how pre-trained language and image rep-resentations are used in their respective domains.The main difﬁculty in learning a general-purpose speechrepresentation is that speech carries complex hierarchicalstructure (samples, phonemes, and sentences) which containsrelevant information at different time scales [Pascual et al. ,2019]. In this work, we propose a Multi-Granularity Frame-work, named MGF, to train the model at multiple time scales.A key innovation in MGF is to adopt different learning ap-proaches for the learning at different time scales. In particu-lar, we use generative approaches to capture ﬁne-grained in-formation for small time scales on the order of a few millisec-onds, and we adopt discriminative approaches to distill se-mantic information for large time scales which correspond toa phoneme and a sentence. In order to realize phoneme-levelcontrastive learning, we extend the token-oriented maskedlanguage model (MLM) model [Devlin et al. , 2019] to contin-uous masked language model (cMLM) to accommodate thecontinuous speech signals without token boundaries. MGF isimplemented by a deep bidirectional Transformer [Vaswani et al. , 2017; Devlin et al. , 2019].We evaluate the MGF representation on multiple down-stream tasks and benchmark datasets, which becomes the sec-ond main contribution of our work. The performance of MGFis ﬁrst evaluated on phoneme classiﬁcation and speaker clas- a r X i v : . [ c s . S D ] F e b iﬁcation tasks, following the other general-purpose speechrepresentation learning work [van den Oord et al. , 2018;Liu et al. , 2020b]. We ﬁnd that features learned by MGF isvery powerful on these two orthogonal tasks. On the Lib-riSpeech dataset, MGF representation achieves a phonemeclassiﬁcation accuracy of 73.4% under linear evaluation, sur-passing the existing unsupervised pre-training methods by alarge margin. On the speaker classiﬁcation task, MGF repre-sentation is the ﬁrst to achieve an accuracy of 100%.We further evaluate MGF in other three downstream tasks.First, in view of the saturated performance in speaker classi-ﬁcation, we propose a new and harder task named one-shotspeaker classiﬁcation , where only one utterance per speakeris provided in the ﬁne-tuning stage. In this task, MGF isevaluated against the well-known x-vector and d-vector andis shown to achieve better performance. Second, we com-pare MGF with a task-speciﬁc pre-training model wav2vecin the ASR task. Third, we test MGF representation on theIEMOCAP emotion classiﬁcation task. Surprisingly, simplyappending a fully-connected layer after MGF achieves the topperformance among all existing audio-based approaches. There are two camps of self-supervised learning approaches,namely discriminative and generative approaches. We willﬁrst review these two approaches for speech pre-training, andthen discuss other related work that motivates MGF.

Discriminative approaches acquire supervision signal fromthe contrastive distance between a selected positive sampleand several negative samples. By carefully designing thetraining target and the data sampling procedure, samples canbe automatically labelled.Contrastive predictive coding (CPC) [van den Oord et al. ,2018] is a contrastive learning method based on predicting thefuture in the latent space. The representations of temporallynearby segments are treated as positive samples while thoseof temporally distant segments are treated as negative sam-ples. However, one could easily ﬁnd a counter example inspeech processing. For example, a word appears twice in anutterance with the same meaning. When the ﬁrst appearanceis the anchor, the second appearance should not be treatedas a negative sample no matter how far it is. Previous work[Chung et al. , 2019] also notices that the choice of negativesamples in CPC has huge effect on its performance on thephoneme classiﬁcation task.While CPC itself is a general-purpose speech pre-trainingmethod, it can be leveraged in some task-speciﬁc pre-trainingmodels, such as wav2vec [Schneider et al. , 2019], vq-wav2vec [Baevski et al. , 2020a], and wav2vec 2.0 [Baevski etal. , 2020b]. Vq-wav2vec proposes a quantization algorithmso that wav2vec (which adopts CPC) can be combined withthe BERT model [Devlin et al. , 2019] to achieve better per-formance. Wav2vec 2.0 improves vq-wav2vec by training theentire model end-to-end. It also uses a very large unlabelleddataset for pre-training. These task-speciﬁc pre-train modelsare very powerful in their target task, but perform poorly inother speech tasks.

Generative approaches learn to reconstruct signal in the inputspace or features in some latent spaces. Training is supervisedby the reconstruction loss. Autoregressive predictive coding(APC) [Chung et al. , 2019] uses an autoregressive model toencode the history and predict the future. A follow-up work[Chung and Glass, 2020] adds an auxiliary objective whichencourages the model to additionally remember the past. De-CoAR [Ling et al. , 2020] borrows the bidirectional learningidea from ELMo [Peters et al. , 2018] so that it can learn deepcontextualized acoustic representations for semi-supervisedspeech recognition.Inspired by the MLM proposed in BERT [Devlin et al. ,2019], recent works [Liu et al. , 2020b; Liu et al. , 2020a] haveexplored using BERT-style objective in speech pre-training.In Mockingjay [Liu et al. , 2020b], part of input frames aremasked to zeros and the pre-trained encoder is required topredict the masked frame from its neighborhood. TERA [Liu et al. , 2020a] extends Mockingjay by introducing channel al-teration and magnitude alteration.

PASE [Pascual et al. , 2019] uses multiple regressors and dis-criminators to learn a problem-agnostic speech encoder. An-other work PASE+ [Ravanelli et al. , 2020] improves PASEfor robust speech recognition in noisy and reverberant envi-ronments by introducing data augmentation, more regressiontasks, and a collection of architecture modiﬁcation.Our work and PASE both consider combinations of gener-ative and discriminative objectives. However, PASE does notconsider speech hierarchy. In our work, different objectivesare used to handle signals at different time scales.

Our work is inspired by some self-supervised learning meth-ods in other domains. BERT [Devlin et al. , 2019] is a mile-stone work for pre-training in NLP. The core of BERT is theMLM, where some input tokens are randomly masked out,and the training objective is to predict the vocabulary ID ofthe masked word based only on its context. SimCLR [Chen etal. , 2020] proposes a simple contrastive learning frameworkfor visual representation learning. It adopts the contrastiveloss between augmented views of an image without relyingon specialized architecture design or memory bank mecha-nism. BERT and SimCLR inspired our phoneme-scale andsentence-scale contrastive learning, respectively.

The main idea behind MGF is to extract the informationattached to each speech hierarchy by multi-granularity ob-jectives. At the ﬁnest granularity, we adopt generative ap-proaches to reconstruct the original waveform and selectedhand-crafted features to extract sample-scale and frame-scaleinformation, respectively. At a coarser granularity, we designa novel continuous masked language model (cMLM) whichmasks several consequent frames with the typical phoneme a w W a v e . Recon. Wave. E n c o d e r DecoderHeadHeadHeadHead C r o p W a v e . A u g . A u g . C r o p W a v e . MFCCfMLLR

Figure 2: A sketch of the multi-granularity framework for self-supervised speech representation learning. Different heads are notweight-sharing. Recon.: Reconstructed. Aug.: Augmentation.Wave.: Waveform. length. The model is trained to estimate the feature embed-ding for one of the masked frames based on the context infor-mation. We do not expect the model to recover the exact fea-tures, but we hope that the estimated feature for the maskedframe is close to the ground truth feature and far away fromfeatures of other frames in different phonemes. At the sen-tence level, learning encourages segments within the samesentence to have close representations and segments acrossdifferent sentences to have representations that are far apart.

Learning of sample-scale and frame-scale information adoptsreconstruction loss, while learning of phoneme-scale andsentence-scale information adopts contrastive loss.

Sample-Scale Loss

As Fig.2 shows, a decoder is appended to the base moduleto reconstruct the original signal from the feature embed-ding. Let x denote the original signal and ¯ x denote the re-constructed signal. The sample-scale loss is implemented bythe scale-invariant signal-to-distortion ratio (SI-SDR) [Roux et al. , 2019] loss which is formulated as: L sample = −

10 log ( | αx | | αx − ¯ x | ) for α = ¯ x T x | x | . (1)SI-SDR loss is widely used in speech separation [Zhao et al. ,2020] and we empirically ﬁnd it works better than L1 loss. Frame-Scale Loss

Several heads composed of two convolutional (conv) layersare appended to the base model to generate selected hand-crafted features frame by frame. These features, includ-ing log power spectrum (LPS), mel-frequency cepstral coef-ﬁcient (MFCC), and maximum likelihood linear regression(fMLLR), have been proven effective in speech-related tasks.Following PASE+, we set the context window for LPS andMFCC to both 25ms and 400 ms. L2 loss is used as the opti-mization target and it is formulated as: L frame = E u ∈U (cid:88) h ∈H ω h || u h − ˆ u h || (2)where U is the set of unmasked frames, H is the set of hand-crafted feature indicators, ω h is the weight of h ( h ∈ H ), and u h , ˆ u h are the ground-truth and the estimation of feature h forframe u , respectively. Phoneme-Scale Loss

A speech segment that is tens to hundreds of millisecondslong contains distinguishable phonemes. We believe thatlearning distinguishable high-level semantic information ismore important than waveform or feature reconstruction onthis time scale.Phoneme-scale information is learned by our proposedcontinuous MLM model and the discriminative learning ob-jective. In the vanilla MLM, discrete tokens from a ﬁnite dic-tionary are masked in the language input and then predicted inthe output based on their context. However, speech is a con-tinuous signal without token boundaries. It is not possible toprecisely mask some pre-deﬁned phonemes. To address thischallenge, we randomly mask a ﬁxed-length speech segment,and use InfoNCE [van den Oord et al. , 2018] loss to evalu-ate the quality of the estimated features of a masked frame.In our implementation, each masked segment has a length of140ms, and the total length of masked segments does not ex-ceed 20% of the input speech crop. The masked segment isreplaced by non-speech noise. We empirically ﬁnd that it is abetter choice than a segment of zeros or a random speech.InfoNCE loss directly operates on real-valued feature vec-tors and is formulated as: L phoneme = − E v ∈V log exp ( v T ˆ v/τ ) (cid:80) Kk =1 exp ( v T v k /τ ) + exp ( v T ˆ v/τ ) (3)where V is the set of masked frames, v is the anchor sample, ˆ v is the positive sample, and v k ( k = 1 , ..., K ) are negativesamples. τ is a temperature parameter.Note that each sample is the feature representation of asingle frame whose duration is 10ms. The anchor sampleis drawn from the MGF representation of the masked crop,while the positive sample and negative samples are drawnfrom the feature representation of the original unmasked crop.In other words, the anchor sample is the estimated featurewhile positive and negative samples are ground-truth featuresat the same or different locations, respectively. Sentence-Scale Loss

Sentence-scale loss focuses on capturing semantic informa-tion relevant to a long time scale. We adopt SimCLR frame-work [Chen et al. , 2020], which was proposed for visual rep-resentation learning, to implement the idea. We make somesmall modiﬁcations to SimCLR. First, since cropping is use-ful only in computing sentence-scale loss, we crop two seg-ments of a sentence before applying other data augmenta-tions. Then, one segment is treated as the original crop forall the other objectives. Second, we use data augmentationsspeciﬁc to speech signal. The augmentations include tempo-ral mask and additive noise.The sentence-scale loss is deﬁned as: L sentence = − log exp ( z Ti z j /τ ) (cid:80) N − k =1 I [ k (cid:54) = j ] exp ( z Ti z k /τ ) (4)where z i is the anchor sample, z j is the positive sample, z k ( k (cid:54) = j ) are negative samples, N is batch-size, and τ isa temperature parameter. Each sample is obtained by aver-aging the MGF representation of all frames within a 2s-longpeech crop and passing the averaged feature through a headof two conv layers. The positive sample corresponds to aspeech crop from the same sentence as the anchor crop, whilenegative samples correspond to crops from other sentences.Following SimCLR, we draw negatives from the mini-batchso that there are N − negative samples for each anchor. Multi-Granularity Objectives

We train the base encoder by combining multi-granularity ob-jectives. The total loss function is deﬁned as: L = λ L sample + λ L frame + λ L phoneme + λ L sentence (5)where λ i , i = 1 , , , are weights of each loss. We tune eachweight in { . , . , . , } independently. MGF is implemented in PyTorch. The base encoder is com-posed of three conv layers and six blocks of Transformer. Theﬁrst conv layer has a kernel size of 320 and implements 512ﬁlters with stride 160 and padding 80. The second conv layerhas a kernel size of 1 followed by ReLU activation. Thesetwo conv layers serve as a stem which transforms the time-domain waveform to a compact feature vector, so that Trans-former does not need to handle very long and low-level input.The third conv layer aligns the dimension with the subsequentTransformer. In addition to the base encoder, the decoder forsample-level reconstruction also uses four blocks of Trans-former. All Transformers share the same parameters with hid-den size d model = 768 , feed-forward size d ff = 3072 , and thenumber of attention heads h = 12 . In most of our experiments, we use the train-clean-100 sub-set of the LibriSpeech corpus [Panayotov et al. , 2015] as thepre-training dataset. It contains 100 hours of speech data. Weuse dev-clean subset as the validation dataset for model se-lection. In the pre-training stage, we use the raw signal onlyand ignore any human labels such as speaker ID or transcrip-tions. Some of the MGF objectives rely on data augmentationof additive noise. We use DNS challenge dataset [Reddy etal. , 2020], which contains 70k clips of non-speech noise, forthis purpose.We evaluate MGF representation in a wide range of down-stream tasks. Except for the speech recognition task, lin-ear evaluation is adopted, where the pre-trained model is ap-pended with only one learnable layer. The classiﬁcation accu-racy is used as the evaluation metric. A majority of general-purpose speech pre-training work are evaluated on phonemeand speaker classiﬁcation tasks. We use these two tasks forablation studies in addition to the system comparison withmethods in the same category. We also evaluated MGF inother downstream tasks to show its generalization capability.

Phoneme classiﬁcation : We follow the setup in CPC[van den Oord et al. , 2018] to use 41 phoneme classes andthe train-clean-100 subset of LibriSpeech for both trainingand testing. For a fair comparison, we use aligned phone la-bels and train/test split provided by CPC.

Speaker classiﬁcation : Following CPC, we ﬁrst use Lib-riSpeech train-clean-100 subset containing 251 speakers forthis task. Using the same train/test split provided by CPC, theproposed MGF achieves a classiﬁcation accuracy of 100%.In view of the saturated performance, we propose a new set-ting called one-shot speaker classiﬁcation . We use the train-clean-360 subset which contains 921 speakers. For eachspeaker, we only put one utterance into the training set andsample 20% from the rest utterances into the test set. Thiscreates 3.2 hours of training data and 72 hours of testing data.

Emotion classiﬁcation : We use interactive emotionaldyadic motion capture (IEMOCAP) dataset [Busso et al. ,2008] for this task. This corpus consists of ﬁve sessions withtwo speakers in each session. Following the usage in [Wu etal. , 2019], we evaluate the MGF representation on four emo-tions, namely

Neutral, Angry, Happy and Sad . The IEMO-CAP dataset contains scripted data and improvised data andwe only use the latter. We report results of ﬁve-fold cross val-idation using four sessions as training and the other sessionas validation and testing.

Speech Recognition : We use Wall Street Journal (WSJ[Garofolo et al. , 1993]) dataset for the ASR task. This corpuscomprises about 81 hours of transcribed audio data. We trainon si284, validate on nov93dev and test on nov92. We usethe lexicon-free [Likhomanenko et al. , 2019] acoustic modeland 4-gram KenLM [Heaﬁeld et al. , 2013] language modelwhich are implemented by wav2letter++ [Pratap et al. , 2019].Word error rate (WER) and letter error rate (LER) are used asevaluation metrics. We use the training recipe provided bywav2letter++ and only modiﬁes the input embedding.For the pre-training and classiﬁcation downstream tasks,we use Adam optimizer with warm-up to update the model.We use learning rate of { } , warm-up steps of { } and batch-sizeof { } for pre-training, phoneme classiﬁ-cation, speaker classiﬁcation, speaker veriﬁcation and emo-tion classiﬁcation training, respectively. We also exponen-tially decay the learning rate with exponent of 0.3. We use 4V100 GPU in both pre-training and downstream task ﬁnetun-ing. The total training epochs for pre-training is set to 300,and more epochs yield slight improvement. For all down-stream task ﬁnetuning, we set total training epochs to 100except the experiments in data efﬁciency. We ﬁrst present ablation study of MGF to show the effective-ness of multi-granularity objectives and cMLM. We reportaccuracies on phoneme classiﬁcation and one-shot speakerclassiﬁcation tasks.

Multi-Granularity Framework

We study whether all four objectives at different time scalescontribute to the ﬁnal accuracy of MGF, and assess their re-spective importance on two target problems. We trained MGFfour times, discarding one of the four loss objectives at a time.Results are presented in Table 1. The ﬁrst row presents the re-sults of the full model and the other rows present the resultswhen a certain objective is discarded.The ﬁrst ﬁnding is that every objective matters. Discardingbjectives Phoneme Acc One-Shot Speaker AccMGF (Full) 73.4 82.7-Sample 73.3 (-0.1) 82.2 (-0.5)-Frame 69.2 (-4.2) 79.9 (-2.8)-Phoneme 66.1 (-7.3) 81.3 (-1.4)-Sentence 73.4 (-0.0) 81.8 (-0.9)

Table 1: Classiﬁcation accuracies on phoneme classiﬁcation andone-shot speaker classiﬁcation using MGF representation. First roware results of full model. Following rows report results when dis-carding each objective. Numbers in the parenthesis emphasize ac-curacy drop.

Approach Phoneme Acc One-Shot Speaker AccDiscriminative 73.4 82.7Generative 66.1 81.3

Table 2: Comparison of discriminative and generative approachesfor phoneme-scale training. any objective leads to notable accuracy drop in at least onetask. Second, while some objectives have general impact ontwo tasks, others turn out to be more task-oriented. For ex-ample, sample-scale loss and frame-scale loss are generallyhelpful. This is consistent with our design as these two losseslearn low-level problem-agnostic information. The frame-scale loss is specially important as it injects human priorknowledge into the model. The other two losses, however,are more task-dependent. Phoneme-scale loss has a remark-able impact on phoneme classiﬁcation task (+27.4% in rela-tive error) and sentence-scale loss only contribute to one-shotspeaker classiﬁcation task. It is worth noting that phoneme-scale loss also contributes a lot to the one-shot speaker clas-siﬁcation task. This is caused by the sampling strategy usedin cMLM, which will be described next. cMLM

MGF uses cMLM and adopts InfoNCE loss for phoneme-scale target. Previous work [Liu et al. , 2020b] has investi-gated a similar masking approach but has used a generativeapproach with reconstruction loss at a similar time scale. Webelieve that our proposed discriminative approach with In-foNCE loss is more suitable for this time scale. To validatethis, we implemented a generative approach which calculatesL1 loss between the predicted frame and the ground-truthframe in the masked segment. The two rows in Table 2 showthe results of using discriminative (InfoNCE) loss and gener-ative (L1) loss, respectively. The model trained with discrim-inative approach achieves 7.3% accuracy gain on phonemeclassiﬁcation and 1.4% accuracy gain on one-shot speakerclassiﬁcation, compared with the model trained with genera-tive approach. Coincidentally, the speaker classiﬁcation accu-racy achieved by the generative approach (81.3%) is the sameas MGF without phoneme-scale objective. In other words, us-ing generative approach for phoneme-scale learning does nothelp the speaker classiﬁcation task at all.It is worth noting that the sample strategy in cMLM has Method Phoneme Acc Speaker AccMFCC 39.7 17.6CPC 65.5 97.4Mockingjay 64.3 96.1TERA-base (3xT) 65.1 99.2TERA-medium (6xT) 65.9 -MGF (6xT) 73.4 100.0

Table 3: Comparison of self-supervised speech representation learn-ing methods on Librispeech phoneme and speaker classiﬁcationtasks under linear evaluation. n xT denotes n Transformer blocks. notable impact on the performance of MGF representation.There are two options to choose negative samples in cMLM:from a different sentence or from the same sentence as thepositive sample. Experimental results show that samplingfrom a different sentence leads to a better performance (1.2%and 0.9% accuracy gain on the two tasks, respectively). Thereason is that a richer vocabulary is helpful for the model tolearn more discriminative features. In addition, the different-sentence sample strategy allows the model to learn featuresthat are able to discriminate speakers.

We compare MGF with CPC [van den Oord et al. , 2018],Mockingjay [Liu et al. , 2020b], and TERA [Liu et al. , 2020a]on phoneme classiﬁcation and speaker classiﬁcation tasks.All the systems use the same setup as speciﬁed in CPC. Re-sults are shown in Table 3.CPC [van den Oord et al. , 2018] is a discriminative ap-proach. We have pointed out earlier that it is not appro-priate in CPC to distinguish positive and negative samplesonly based on the distance to the anchor. In contrast, MGFuses a masked model. When the input crop is masked or un-masked, features at the same location always form a positivepair. In addition, MGF is a multi-granularity framework anduses bidirectional model. These advances explain why MGFoutperforms CPC by a large margin on both tasks. Note thatthe evaluated MGF model has 12G FLOPs, which is a bitheavier than CPC which has 9.7G FLOPs.Mockingjay [Liu et al. , 2020b] and TERA [Liu et al. ,2020a] are generative approaches which try to predict acous-tic frames from its manipulated version. Mockingjay onlyuses temporal alteration and TERA extends it by addingchannel alteration and magnitude alteration. In the previ-ous section, we have tried similar generative loss and ﬁndthat discriminative loss works better. Table 3 shows thatMGF gets 9.1%/8.3% accuracy gain on phoneme classiﬁca-tion and 3.9%/0.8% accuracy gain on speaker classiﬁcationover Mockingjay and TERA, respectively.

One-Shot Speaker Classiﬁcation

We additionally evaluate MGF representation against twowell-known feature embeddings, namely d-vector [Wan et al. ,2018] and x-vector [Snyder et al. , 2018], in one-shot speakerethod nov93dev nov92LER WER LER WERBaseline 3.50 8.57 2.09 5.42wav2vec-large 2.91 7.24 1.64 4.48MGF-960 3.07 7.58 1.78 4.87

Table 4: WSJ speech recognition results.

Method Modality Emotion AccM3ER AVT 82.7CNN LSTM A 68.8CNN GRU-SeqCap A 72.7MGF-Scratch A 64.1MGF-Fixed A 71.2MGF-Finetune A 73.1

Table 5: IEMOCAP emotion classiﬁcation accuracies of differentmethods. A, V, T are shorts for audio, video and text, respectively. classiﬁcation task. d-vector is learned via a generalized end-to-end loss, which is similar to the triplet loss. x-vectoris learned via a speaker recognition task using a time-delayDNN. In particular, x-vector is recognized as the state-of-the-art embedding for speaker classiﬁcation tasks. d-vectorand x-vector are both pre-trained on Switchborad [Godfrey etal. , 1992] and NIST SREs [Doddington et al. , 2000] datasetsand x-vector is implemented ofﬁcially via Kaldi’s V2 recipe[Povey et al. , 2011].We use the same linear evaluation protocol for all the meth-ods to ensure a fair comparison. d-vector achieves 77.8% ac-curacy and x-vector achieves 79.6% accuracy. As a compar-ison, MGF achieves 82.7% accuracy which reduces the rela-tive error by 15.2% compared with the best counterpart.

Speech Recognition

We evaluate the performance of an ASR system built on topof MGF representation. The baseline method uses 80 log-mel ﬁlterbank coefﬁcients with a 25ms sliding window and10ms stride. We also compare MGF with a well-known ASR-oriented self-supervised method wav2vec [Schneider et al. ,2019]. We use their released wav2vec-large checkpoint .As wav2vec is pre-trained with the entire 960 hours of Lib-riSpeech training set, we train the model MGF-960 with thesame amount of training data to ensure a fair comparison.MGF-960 has the same architecture and model size as ourbase model. As shown in Table 4, MGF-960 achieves signiﬁ-cantly lower WER than the baseline and comparable WER aswav2vec-large. This means, without bells and whistles, ourgeneral-purpose speech representation can beneﬁt the speechrecognition task. Emotion Classiﬁcation

We present experimental results of emotion classiﬁcation onIEMOCAP dataset. We want to use this task to evaluate theadaptation capability of MGF representation. Since this task https://github.com/pytorch/fairseq/tree/master/examples/wav2vec − Data Usage (%) P hon e m e A cc u r ac y ( % ) train-from-scratchpre-trained Figure 3: Comparison of representations with phoneme classiﬁca-tion accuracy across different amount of labeled data. is never evaluated by previous self-supervised representationlearning methods, we compare MGF with state-of-the-art su-pervised methods instead. M3ER [Mittal et al. , 2020] usestext, audio and video to predict speaker’s emotion. CNNGRU-SeqCap [Wu et al. , 2019], CNN LSTM [Satt et al. ,2017], and our MGF only use audio. We set three differentsettings for MGF. MGF-Scratch does not have pre-trainingstage while MGF-Fixed and MGF-Finetune are pre-trained.The base encoder of MGF is not trainable in `fixed” andtrainable in `Finetune”. As shown in Table 5, MGF-Scratchdoes not work well but MGF-Fixed and MGF-Finetune bothachieve high accuracy, showing that pre-training does do a lothelp. MGF-Fintune even creates a new SOTA among audio-only methods.

Last but not least, we show how pre-training could help inlow-resource scenarios where human labels are scarce. Weuse LibriSpeech phoneme classiﬁcation task and reduce thelabeled data usage from 100% to 0.1%. The performance ofMGF in different settings are plotted in Figure 3. To comparefairly with the model without pre-training, we open up the en-tire model for ﬁne-tuning. We ﬁnd that pre-trained MGF out-performs its train-from-scratch counterpart by a large marginwhen the length of labeled data is less than one hour. In anextreme low-resource scenario where only six minutes of la-beled data are available, the pre-trained model still achievesa reasonably good performance of 72.3% phoneme accuracywhile the train-from-scratch model is only able to achieve34.9% phoneme accuracy.

We have proposed a multi-granularity framework for self-supervised speech representation learning. By taking thespeech hierarchy into consideration, MGF achieves top per-formance among existing speech pre-training methods on acollection of speech tasks. Comprehensive ablation studieshave been carried out to demonstrate the effectiveness of ourdesign in MGF. In the future, we plan to expand this multi-granularity self-supervised framework to the image domain,which may beneﬁt tasks that demand multi-scale features. eferences [Baevski et al. , 2020a] Alexei Baevski, Steffen Schneider, andMichael Auli. vq-wav2vec: Self-supervised learning of discretespeech representations. In

ICLR , 2020.[Baevski et al. , 2020b] Alexei Baevski, Yuhao Zhou, AbdelrahmanMohamed, and Michael Auli. wav2vec 2.0: A framework forself-supervised learning of speech representations. In

NeurIPS ,2020.[Busso et al. , 2008] Carlos Busso, Murtaza Bulut, Chi-Chun Lee,Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N.Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP:interactive emotional dyadic motion capture database.

Lang. Re-sour. Evaluation , 2008.[Chen et al. , 2020] Ting Chen, Simon Kornblith, MohammadNorouzi, and Geoffrey E. Hinton. A simple framework for con-trastive learning of visual representations. In

ICML , 2020.[Chung and Glass, 2020] Yu-An Chung and James R. Glass. Im-proved speech representations with multi-target autoregressivepredictive coding. In

ACL , 2020.[Chung et al. , 2019] Yu-An Chung, Wei-Ning Hsu, Hao Tang, andJames R. Glass. An unsupervised autoregressive model forspeech representation learning. In

INTERSPEECH , 2019.[Devlin et al. , 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee,and Kristina Toutanova. BERT: pre-training of deep bidirectionaltransformers for language understanding. In

NAACL-HLT (1) ,2019.[Doddington et al. , 2000] George R. Doddington, Mark A. Przy-bocki, Alvin F. Martin, and Douglas A. Reynolds. The NISTspeaker recognition evaluation - overview, methodology, sys-tems, results, perspective.

Speech Commun. , 2000.[Garofolo et al. , 1993] John Garofolo, David Graff, Doug Paul, andDavid Pallett. Csr-i (wsj0) complete ldc93s6a.

Web Download.Philadelphia: Linguistic Data Consortium , 1993.[Godfrey et al. , 1992] John J. Godfrey, Edward Holliman, and JaneMcDaniel. SWITCHBOARD: telephone speech corpus for re-search and development. In

ICASSP , 1992.[He et al. , 2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie,and Ross B. Girshick. Momentum contrast for unsupervised vi-sual representation learning. In

CVPR , 2020.[Heaﬁeld et al. , 2013] Kenneth Heaﬁeld, Ivan Pouzyrevsky,Jonathan H. Clark, and Philipp Koehn. Scalable modiﬁedkneser-ney language model estimation. In

ACL (2) , 2013.[Likhomanenko et al. , 2019] Tatiana Likhomanenko, Gabriel Syn-naeve, and Ronan Collobert. Who needs words? lexicon-freespeech recognition. In

INTERSPEECH , 2019.[Ling et al. , 2020] Shaoshi Ling, Yuzong Liu, Julian Salazar, andKatrin Kirchhoff. Deep contextualized acoustic representationsfor semi-supervised speech recognition. In

ICASSP , 2020.[Liu et al. , 2020a] Andy T. Liu, Shang-wen Li, and Hung-yi Lee.TERA: self-supervised learning of transformer encoder represen-tation for speech.

CoRR , abs/2007.06028, 2020.[Liu et al. , 2020b] Andy T. Liu, Shu-Wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. Mockingjay: Unsupervised speechrepresentation learning with deep bidirectional transformer en-coders. In

ICASSP , 2020.[Mittal et al. , 2020] Trisha Mittal, Uttaran Bhattacharya, RohanChandra, Aniket Bera, and Dinesh Manocha. M3ER: multiplica-tive multimodal emotion recognition using facial, textual, andspeech cues. In

AAAI , 2020. [Panayotov et al. , 2015] Vassil Panayotov, Guoguo Chen, DanielPovey, and Sanjeev Khudanpur. Librispeech: An ASR corpusbased on public domain audio books. In

ICASSP , 2015.[Pascual et al. , 2019] Santiago Pascual, Mirco Ravanelli, JoanSerr`a, Antonio Bonafonte, and Yoshua Bengio. Learningproblem-agnostic speech representations from multiple self-supervised tasks. In

INTERSPEECH , 2019.[Peters et al. , 2018] Matthew E. Peters, Mark Neumann, MohitIyyer, Matt Gardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. Deep contextualized word representations. In

NAACL-HLT , 2018.[Povey et al. , 2011] Daniel Povey, Arnab Ghoshal, Gilles Bou-lianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al.The kaldi speech recognition toolkit. In

WASRU , 2011.[Pratap et al. , 2019] Vineel Pratap, Awni Hannun, Qiantong Xu,Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky,and Ronan Collobert. Wav2letter++: A fast open-source speechrecognition system. In

ICASSP , 2019.[Ravanelli et al. , 2020] Mirco Ravanelli, Jianyuan Zhong, SantiagoPascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, andYoshua Bengio. Multi-task self-supervised learning for robustspeech recognition. In

ICASSP , 2020.[Reddy et al. , 2020] Chandan K. A. Reddy, Vishak Gopal, RossCutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey,Sergiy Matusevych, Robert Aichner, Ashkan Aazami, SebastianBraun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke.The INTERSPEECH 2020 deep noise suppression challenge:Datasets, subjective testing framework, and challenge results.

CoRR , abs/2005.13981, 2020.[Roux et al. , 2019] Jonathan Le Roux, Scott Wisdom, Hakan Erdo-gan, and John R. Hershey. SDR - half-baked or well done? In

ICASSP , 2019.[Satt et al. , 2017] Aharon Satt, Shai Rozenberg, and Ron Hoory.Efﬁcient emotion recognition from speech using deep learningon spectrograms. In

INTERSPEECH , 2017.[Schneider et al. , 2019] Steffen Schneider, Alexei Baevski, RonanCollobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. In

INTERSPEECH , 2019.[Snyder et al. , 2018] David Snyder, Daniel Garcia-Romero, Gre-gory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors:Robust DNN embeddings for speaker recognition. In

ICASSP ,2018.[van den Oord et al. , 2018] A¨aron van den Oord, Yazhe Li, andOriol Vinyals. Representation learning with contrastive predic-tive coding.

CoRR , abs/1807.03748, 2018.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, Niki Par-mar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. Attention is all you need. In

NIPS ,2017.[Wan et al. , 2018] Li Wan, Quan Wang, Alan Papir, and IgnacioLopez-Moreno. Generalized end-to-end loss for speaker veriﬁ-cation. In

ICASSP , 2018.[Wu et al. , 2019] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li,Jianwei Yu, Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu,Xunying Liu, and Helen Meng. Speech emotion recognition us-ing capsule networks. In

ICASSP , 2019.[Zhao et al. , 2020] Yucheng Zhao, Chong Luo, Zheng-Jun Zha, andWenjun Zeng. Multi-scale group transformer for long sequencemodeling in speech separation. In