[PDF] Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Abstract

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

Full PDF

LLearning Speech Representations from Raw Audioby Joint Audiovisual Self-Supervision

Abhinav Shukla Stavros Petridis

Maja Pantic

Abstract

The intuitive interaction between the audio andvisual modalities is valuable for cross-modalself-supervised learning. This concept has beendemonstrated for generic audiovisual tasks likevideo action recognition and acoustic scene clas-siﬁcation. However, self-supervision remainsunder-explored for audiovisual speech. We pro-pose a method to learn self-supervised speech rep-resentations from the raw audio waveform. Wetrain a raw audio encoder by combining audio-only self-supervision (by predicting informativeaudio attributes) with visual self-supervision (bygenerating talking faces from audio). The vi-sual pretext task drives the audio representationsto capture information related to lip movements.This enriches the audio encoder with visual infor-mation and the encoder can be used for evaluationwithout the visual modality. Our method attainscompetitive performance with respect to exist-ing self-supervised audio features on establishedisolated word classiﬁcation benchmarks, and sig-niﬁcantly outperforms other methods at learningfrom fewer labels. Notably, our method also out-performs fully supervised training, thus providinga strong initialization for speech related tasks. Ourresults demonstrate the potential of multimodalself-supervision in audiovisual speech for learn-ing good audio representations.

1. Introduction

Self-supervised learning of representations from large unla-beled datasets is a popular contemporary trend in machinelearning. After being widely adopted in areas like natural

Abhinav Shukla’s work was supported by a PhD scholarshipby Samsung Electronics, UK. Imperial College London, UK Samsung AI Centre, Cambridge, UK Facebook London, UK.Correspondence to: Abhinav Shukla < [email protected] > . Published at the workshop on Self-supervision in Audio and Speechat the th International Conference on Machine Learning , Vi-enna, Austria. Copyright 2020 by the author(s). language processing and computer vision, self-supervisionis now rapidly developing as a noteworthy topic in audioand speech processing. Self-supervision aims to capturethe most informative properties from the underlying struc-ture of unlabeled data to learn generalized representations.This is extremely promising in problem settings involv-ing a large amount of unlabeled data but limited labeleddata. In the context of audio and speech processing, thisis relevant to low resource languages, emotion recognition,cross-cultural speech recognition and other such problemswith small-sized datasets. Even though there has been re-cent research interest in self-supervised learning for speechdata, most works focus only on the audio modality alone.Audiovisual speech data offers interesting possibilities forcross-modal self-supervision, which is something relativelylesser explored. In this work, we present a method for self-supervised representation learning of audio features thatleverages both the audio and visual modalities. We demon-strate how generating a talking lip video from a single frameand the corresponding audio can be used as a pretext taskfor visual self-supervision to train a raw audio encoder.We combine this with audio-only self-supervision based onpredicting informative audio attributes, similar to (Pascualet al., 2019). This results in an audio encoder trained byjoint audiovisual self-supervision. We evaluate the methodon spoken word classiﬁcation and achieve competitive re-sults when comparing with existing self-supervised methods.Our method also results in signiﬁcantly better performancewhen learning with limited data (10 % of training set) forthe downstream tasks. Importantly, our method also outper-forms fully supervised training (directly training the encoderon the downstream task). Our observations motivate the util-ity of self-supervised pretraining for audio related tasks.We demonstrate that cross-modal supervision in audiovi-sual speech can learn better representations compared tounimodal audio-only or visual-only self-supervision.

Self-supervised learning has been very inﬂuential in recentadvances in natural language processing (BERT (Devlinet al., 2018), RoBERTa (Liu et al., 2019) etc.) and com-puter vision (CPC (Oord et al., 2018), MoCo (He et al.,2020), PIRL (Misra & van der Maaten, 2019) etc.). It is a r X i v : . [ ee ss . A S ] J u l earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision also beginning to mature as a relevant topic in audio andspeech processing. CPC (Contrast Predictive Coding) (Oordet al., 2018) was a seminal work in self-supervised learningwhich also demonstrated the applicability of contrastive self-supervised learning to audio. Wav2vec (Schneider et al.,2019) reﬁnes the idea from CPC speciﬁcally for speech.CPC based self-supervision has also been shown to general-ize well to multiple languages (Rivi`ere et al., 2020). APC(Autoregressive Predictive Coding) (Chung et al., 2019) isa similar approach that predicts the next token of a speechsegment from the history. Another very relevant recentwork is PASE (Problem Agnostic Speech Encoder) (Pas-cual et al., 2019), which aims to learn multi-task speechrepresentations from raw audio by predicting a number ofhandcrafted features such as MFCCs, prosody and wave-form. Teacher-student models have also been explored foraudio self-supervision where the trained model from a pre-vious epoch acts as the teacher model for the next epoch(Kumar & Ithapu, 2020). All of the works discussed so farare unimodal audio-only self-supervised methods. Thereare also a few other works that utilize both audio and vi-sual information. There are multiple ways to capture thiscross-modal interaction including audiovisual synchroniza-tion (Owens et al., 2018), cross-modal transition modeling(Pham et al., 2019), cross-modal pseudolabel based clus-tering (Alwassel et al., 2019), contrastive learning (Tianet al., 2019; Patrick et al., 2020), and audiovisual instancediscrimination (Morgado et al., 2020). However most ofthese works present cross-modal self-supervision in the con-text of generic audiovisual data, with application to taskslike video action recognition and acoustic scene classiﬁca-tion. There is limited work that explores self-supervisionspeciﬁcally in the context of audiovisual speech. We haveexplored this concept in recent related work (Shukla et al.,2020c;b;a). This work extends the idea from our prior work.Speciﬁcally, we move from learning speech representationsdirectly from raw audio instead of from mel features. Wealso adopt a different and more reﬁned approach for audio-only self-supervision (described in Section 2.3).

2. Method

We use a 1D Resnet18 (He et al., 2016) encoder as the back-bone for all of our proposed methods (detailed architecturein appendix). The encoder f a (see Fig. 2 and 3) takes asinput a 16 kHz raw audio waveform and converts it into a512-D audio feature vector for every timestep. The outputsample rate is 25 audio feature vectors per second, whichmatches that of 25 FPS video in the LRW dataset. Thisallows us to have a one-to-one mapping between the twomodalities, which helps in cross-modal learning and allowsus to avoid oversampling or undersampling either modal- ity. Other contemporary self-supervised methods (Alwasselet al., 2019; Patrick et al., 2020) use a 2D Resnet18 audioencoder operating on mel features (operating similar to im-age based CNNs). However, we wanted our audio encoderto directly operate on the raw audio waveform and performend-to-end self-supervised representation learning withoutstarting from an intermediate feature like MFCCs or log melspectrograms, which is why we chose a 1D Resnet18. For visual self-supervision, we generate a talking lip videofrom a still image and the corresponding audio (see Fig. 1and Fig. 2). The model is comprised of three components:(i) the audio encoder f a (1D Resnet18), (ii) the identityencoder f id , and (iii) the frame decoder f d . The model oper-ates on 1 second long segments from an audiovisual speechdataset. The audio encoder f a (Fig. 2 bottom-left) convertsthe 1 second audio sample x into a 512 dimensional em-bedding with 25 timesteps ( z aud ). The identity encoder f id (Fig. 2 top-left) is a 6 layer CNN that converts the mouthregion of the ﬁrst video frame x im (a 64x64 image) into a64 dimensional identity embedding ( z id ). This embeddingis replicated 25 times to match the timesteps of the audioembedding. The latent representation z is the concatena-tion of z aud and z id (as shown in Fig. 2). This then goesthrough the frame decoder f d (see Fig. 2 top-right), which isa CNN that uses strided transposed convolutions to generatethe video frames of the lip movements. The skip connec-tions between the identity encoder and frame decoder helpin preserving subject identity in the generated frames. AnL1 reconstruction loss between frames from the generatedvideo ( f d ( z ) ) and those from the real video ( y video ) is usedto train the network. We use the L1 loss as opposed to theL2 loss to get relatively sharper reconstructions. Our modelaims to predict lip movements given only audio and speakeridentity information from the ﬁrst frame. In this process, theaudio encoder is driven to produce useful speech featuresthat correlate with lip movements (because accurate lipmovement reconstruction will reduce the loss). The au-dio features obtained by reconstructing lip movements arelikely to contain information about the speech content. Ourproposed method is related to our prior work on visual self-supervision to learn audio features (Shukla et al., 2020c;b;a).In this work, the key difference is that we use a raw audioencoder for end-to-end learning as opposed to the log melspectrogram encoder we used in (Shukla et al., 2020b;a).Also, instead of reconstructing the full face, we focus on themouth region which contains visual information about thespeech content, which we hypothesized would lead to betterrepresentations for speech recognition. z ( x, x im ) = cat ( f a ( x ) , f id ( x im )) (1) L video ( x, x im ) = | f d ( z ( x, x im )) − y video | (2) earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision Figure 1.

An illustration of the encoder-decoder model we use for joint audiovisual self-supervision. From an unlabeled sample ofaudiovisual speech, we use the raw audio waveform and the ﬁrst video frame to generate a talking lip video. Lip movement reconstructionoffers visual self-supervision. We also use decoders to reconstruct salient audio attributes (MFCCs, log mel, waveform) for audio-onlyself-supervision. By jointly optimizing the reconstruction losses for both modalities, we get joint audiovisual self-supervision. The trainedaudio encoder can then be used for audio-only downstream tasks.

Table 1.

Results for spoken word classiﬁcation (Accuracy in %) on the Speech Commands (SPC, 30 classes) (Warden, 2018) and theLip Reading in the Wild (LRW, 500 classes) (Chung & Zisserman, 2016) datasets. For evaluation, a 2 layer GRU model is used on theencoder outputs for each pretraining method, before ﬁnetuning on the downstream task.

Pretraining method Self-supervision Input type Dataset and % of Labels usedSPC SPC LRW LRW100% 10% 100% 10%MFCC - - 94.33 87.08 90.16 37.56PASE (Pascual et al., 2019) Audio Raw audio 95.61 83.81 93.40 1.88APC (Chung et al., 2019) Audio Mel features 94.87 89.91 93.97 57.41wav2vec (Schneider et al., 2019) Audio Raw audio 96.04 91.57 94.60 19.50L1 (Shukla et al., 2020b) Visual Mel features 95.11 86.43 94.45 33.43L1 + Odd (Shukla et al., 2020b) Audiovisual Mel features 95.77 90.16 94.72 67.98Ours (A) Audio Raw audio 95.06 90.56 94.14 69.70Ours (V) Visual Raw audio 94.38 88.31 92.18 52.99Ours (AV) Audiovisual Raw audio 95.21 90.63 95.37 77.13Supervised 1D Resnet18 - Raw audio 93.79 81.12 90.34 13.72 earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

In prior work (Shukla et al., 2020b), we employed tempo-ral order based pretext task for audio-only self-supervision(predicting which of the inputs are jumbled or reversed).We wanted to examine whether it is possible to yield betterspeech representations using a more reﬁned pretext task. Inthis work, our methodology for audio-only self-supervisionis inspired from PASE (Pascual et al., 2019). We predictthree informative audio attributes: (i) MFCCs, (ii) Log melspectrograms, and (iii) the waveform. The key difference ofour method with PASE is the fact that we directly train a 1DResnet18 encoder model on the raw audio waveform. PASErequires intermediate steps like adding speech distortionsfor data augmentation, SincNet ﬁlters, and a penultimateQuasi-RNN layer. We also adopt only 3 of the most infor-mative predicted attributes from PASE for simplicity. Fig. 3illustrates our method for audio-only self-supervision. Theaudio encoder ( f a ) converts 1 second of 16 kHz input audio( x ) into a 512 dimensional audio embedding ( z aud ) with25 timesteps (exactly the same as in the method for visualself-supervision). The audio representation is then used asinput to three separate decoders ( f mfcc , f logmel & f wav )that reconstruct the desired audio attributes. We keep thedecoder architectures as simple as possible in order to incen-tivize the important information about the audio attributesto be captured by the audio encoder. The MFCC and the logmel spectrogram decoders (Fig. 3 right) are both comprisedof a single fully connected layer of 256 units. The waveformdecoder (Fig. 3 top-left) is made of a transposed convolutionlayer followed by a convolution layer that outputs the re-constructed waveform (in an autoencoder-like fashion). Weuse an L1 loss between each reconstructed attribute with itsground truth ( y attrib ) to train the model. The total loss is thesum of the MFCC loss, the log mel loss, and the waveformloss. For attrib ∈ { mf cc, logmel, wav } , the loss is: L audio ( x ) = (cid:88) attrib | f attrib ( f a ( x )) − y attrib | (3) For joint audiovisual self-supervision (see Fig. 1), we sim-ply combine the two proposed methods for visual-only andaudio-only self-supervision. Since the same audio encoderarchitecture has been used in both models, we can simplyuse the shared audio representation as input to each of thefour decoders (frame decoder, MFCC decoder, log mel de-coder, waveform decoder). The total loss is the sum of theaudio-only and the visual-only losses. The audio encoder( f a ) is thus trained end-to-end and is driven to produce fea-tures that contain information about each of the predictedattributes from both the audio and the visual modalities. L total ( x, x im ) = L video ( x, x im ) + L audio ( x ) (4)

3. Experiments

Datasets

The LRW dataset (Chung & Zisserman, 2016)is a large, in-the-wild dataset of 500 different isolated wordsprimarily from BBC recordings. It is an audiovisual speechdataset and is thus appropriate for training our methods. Weuse a subset of LRW that has only nearly frontal videos(with yaw, pitch and roll restricted to a maximum of 10degrees), in order to have a cleaner supervisory signal fromthe visual modality. This ﬁltering leaves us with a total ofaround 40 hours of usable data. We use this subset of theLRW dataset for self-supervised pretraining of our proposedmethods. We also use it as a spoken word classiﬁcationevaluation dataset. The SPC (Speech Commands v0.01)dataset (Warden, 2018) contains 64,727 total utterances of30 different words by 1,881 speakers. We use SPC also as aspoken word classiﬁcation evaluation dataset.

Baselines

We compare our methods against other self-supervised methods for learning speech representations. Forall the baselines, we use the code (and pretrained models)provided by the authors. We compare against PASE (Pas-cual et al., 2019), APC (Chung et al., 2019) and wav2vec(Schneider et al., 2019). We also compare against our priorrelated work. L1 (Shukla et al., 2020b) is similar to our pro-posed method for visual-only self-supervision but is basedon log mel spectrograms as opposed to raw audio. L1 + Odd(Shukla et al., 2020b) is an audio-visual self-supervisedmethod. We use a more reﬁned audio self-supervision ap-proach in this work. We also compare our methods againsttwo supervised learning baselines for audio. We use 39 di-mensional MFCCs (13 coefﬁcients, 13 deltas, and 13 delta-deltas) as the ﬁrst supervised baseline. The second baselineis a fully supervised 1D Resnet18 model (same architectureas our pretrained encoders but trained from scratch directlyon the evaluation datasets).

Experimental setup

We evaluate all methods on isolatedword classiﬁcation on the Speech Commands (SPC) (War-den, 2018) and Lip Reading in the Wild (LRW) (Chung &Zisserman, 2016) datasets. We use a 2 layer BiGRU (with256 units in each layer) on the encoder outputs followed bya linear layer with as many units as the number of targetclasses (30 for SPC, 500 for LRW). This acts as the down-stream classiﬁer and remains the same for every method.For downstream classiﬁction, we ﬁnetune the models (asshown in bottom of Fig. 1) for 50 epochs. The learning rateis 0.0001 for the ﬁrst 40 epochs and 0.00001 for the last10 epochs. We use the standard softmax + cross entropyloss for training. We opted to use a BiGRU for simplicity,however this can be replaced by any model that can clas-sify variable length sequences into discrete categories (suchas LSTMs, TCNs, LiGRUs (Ravanelli et al., 2018)). Theresults can be seen in Table 1. earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Table 2.

Results for spoken word classiﬁcation (Accuracy in %) under various levels of introduced noise (SNR in dB). Babble noise fromthe NOISEX database is used to perturb the audio samples in the LRW and SPC datasets.

Dataset Model Noise level (SNR)-5 dB 0 dB 5dB 10 dB 15 dB 20 dB CleanSPC MFCC 76.31 84.97 90.56 91.98 93.05 94.19 94.33Ours (A) 79.35 88.42 92.34 93.41 94.63 95.04 95.06Ours (V) 77.92 86.92 91.01 92.80 93.47 93.88 94.38Ours (AV) 79.79 88.69 92.21 93.57 94.65 95.02 95.21LRW MFCC 50.18 70.75 81.08 85.74 88.41 90.11 90.16Ours (A) 58.84 79.13 89.14 91.72 92.87 93.84 94.14Ours (V) 51.40 73.47 84.61 88.11 90.98 91.58 92.18Ours (AV) 64.63 82.59 90.08 92.09 92.91 93.87 95.37

Results with all labels

With 100% of the training datasetused, all self-supervised methods achieve comparable per-formance and outperform fully supervised training. On theSPC dataset, the best overall performance is attained bywav2vec with an accuracy of 96.04%, followed by our priorwork at 95.77%, PASE at 95.61% and our proposed methodat 95.21%. On LRW, the best performance is by our methodwith an accuracy of 95.37%.

Learning with fewer labels

The concept of self-supervision is especially relevant to situations where labeleddata is scarce. To compare the methods in such situations,we perform the same word classiﬁcation experiments onthe SPC and LRW datasets but with only 10% of the sam-ples being used in the training set (the validation and testsets remain unchanged). Note that we completely omit theremaining 90% of the training set (see Tables 6, 7, 8 forexact split details). This leaves us with around 170 trainingexamples per class for the SPC dataset (30 classes) and onlyaround 20 training examples per class for the LRW dataset(500 classes). This makes the problem signiﬁcantly morechallenging. On SPC, there is a slight degradation in theperformance of all methods. Our method attains an accuracyof 90.63% which is second to only wav2vec at an accuracyof 91.57%. On LRW, all other methods get severely affectedand overﬁt to the small training set. Our method is the leastaffected and signiﬁcantly outperforms all other methodswith a best performance of 77.13%.

Noisy situations

We also compare the performance of thevariations of our method under various levels of artiﬁciallyinduced noise. We introduce babble noise from the NOISEX(Varga & Steeneken, 1993) database to create noisy versionsof the SPC and LRW datasets. We use six levels of noise, inthe range of -5 dB SNR to 20 dB SNR in increments of 5dB. The results for the noisy datasets can be seen in Table 2.All our methods outperform MFCCs at all noise levels onboth datasets. The joint audiovisual method is the best.

4. Discussion

There are multiple interesting observations from our ob-tained results. Audio-only supervision yields better resultsthan visual-only supervision. However, the model trainedwith joint audiovisual self-supervision performs better thanthe models trained with unimodal audio-only and visual-only self-supervision in almost all scenarios. includingnoisy datasets. This highlights the utility of the comple-mentary information encoded by visual self-supervision anddemonstrates the potential of multimodal self-supervision asa useful tool in speech representation learning. Also notably,despite all tested methods being very similar in performanceon the full datasets, there is a clear gap when using a smalltraining set and our method is the best at learning with fewerlabels, which is very relevant to low resource domains. Thiscan have signiﬁcant impact in problems like low resourcelanguage ASR, emotion recognition and cross-cultural ASR.Our method also signiﬁcantly outperforms fully supervisedtraining from scratch, which further motivates the utility ofself-supervised pretraining for speech.

Future work

This is a work in progress and there aremany other speech related applications that we can evaluateour model on. In this work, we only focused on the clas-siﬁcation of isolated words. We will also test the modelon continuous CTC based speech recognition on datasetslike Librispeech and TIMIT, and other tasks like speakeridentiﬁcation and speech emotion recognition. An espe-cially relevant application would be low resource languageASR. There are also interesting directions to explore toimprove our method. In this work, we exhibit how jointaudiovisual information can be used for audio representa-tion learning. In a similar manner, we could also utilize thiscross-modal information for visual representation learning(e.g. predicting speech attributes from the visual modality).Another interesting line of work is multimodal contrastiveself-supervised learning which has been demonstrated forgeneric audiovisual data but not for audiovisual speech. earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

References

Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., andTran, D. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667 , 2019.Chung, J. and Zisserman, A. Lip reading in the wild. In

ACCV , 2016.Chung, Y., Hsu, W., Tang, H., and Glass, J. An unsupervisedautoregressive model for speech representation learning. arXiv:1904.03240 , 2019.Devlin, J., Chang, M., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv:1810.04805 , 2018.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-mentum contrast for unsupervised visual representationlearning.

CVPR , 2020.Kumar, A. and Ithapu, V. K. Secost: Sequential co-supervision for weakly labeled audio event detection.

Pro-ceedings of the International Conference on AcousticsSpeech and Signal Processing (ICASSP) , 2020.Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.Misra, I. and van der Maaten, L. Self-supervised learn-ing of pretext-invariant representations. arXiv preprintarXiv:1912.01991 , 2019.Morgado, P., Vasconcelos, N., and Misra, I. Audio-visual in-stance discrimination with cross-modal agreement. arXivpreprint arXiv:2004.12943 , 2020.Oord, A., Li, Y., and Vinyals, O. Representation learn-ing with contrastive predictive coding. arXiv preprintarXiv:1807.03748 , 2018.Owens, A., Wu, J., McDermott, J., Freeman, W., and Tor-ralba, A. Learning sight from sound: Ambient soundprovides supervision for visual learning.

IJCV , 126(10):1120–1137, 2018.Pascual, S., Ravanelli, M., Serr`a, J., Bonafonte, A., and Ben-gio, Y. Learning problem-agnostic speech representationsfrom multiple self-supervised tasks.

Interspeech , 2019. Patrick, M., Asano, Y. M., Fong, R., Henriques, J. F.,Zweig, G., and Vedaldi, A. Multi-modal self-supervisionfrom generalized data transformations. arXiv preprintarXiv:2003.04298 , 2020.Pham, H., Liang, P., Manzini, T., Morency, L., and P´oczos,B. Found in translation: Learning robust joint representa-tions by cyclic translations between modalities. In

AAAI ,volume 33, pp. 6892–6899, 2019.Ravanelli, M., Brakel, P., Omologo, M., and Bengio, Y.Light gated recurrent units for speech recognition.

IEEETransactions on Emerging Topics in Computational Intel-ligence , 2(2):92–102, 2018.Rivi`ere, M., Joulin, A., Mazar´e, P.-E., and Dupoux, E. Un-supervised pretraining transfers well across languages. arXiv preprint arXiv:2002.02848 , 2020.Schneider, S., Baevski, A., Collobert, R., and Auli, M.wav2vec: Unsupervised pre-training for speech recog-nition. arXiv:1904.05862 , 2019.Shukla, A., Petridis, S., and Pantic, M. Visual self-supervision by facial reconstruction for speech representa-tion learning.

Sight and Sound Workshop, CVPR , 2020a.Shukla, A., Petridis, S., and Pantic, M. Does visual self-supervision improve the learning of speech representa-tions? arXiv preprint arXiv:2005.01400 , 2020b.Shukla, A., Vougioukas, K., Ma, P., Petridis, S., and Pantic,M. Visually guided self supervised learning of speech rep-resentations.

Proceedings of the International Conferenceon Acoustics Speech and Signal Processing (ICASSP) ,2020c.Tian, Y., Krishnan, D., and Isola, P. Contrastive multiviewcoding. arXiv preprint arXiv:1906.05849 , 2019.Varga, A. and Steeneken, H. J. Assessment for automaticspeech recognition: Ii. noisex-92: A database and anexperiment to study the effect of additive noise on speechrecognition systems.

Speech communication , 12(3):247–251, 1993.Warden, P. Speech commands: A dataset forlimited-vocabulary speech recognition. arXiv preprintarXiv:1804.03209 , 2018. earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

AppendixA. Audio encoders

Table 3.

Encoder type and number of trainable parameters in eachof the compared methods.

Method Encoder type ParametersPASE SincNet + CNN + FC 5,818,020APC Log mel + GRU 4,105,296wav2vec CNN 32,537,088L1 + Odd Log mel + GRU 4,065,282Ours 1D Resnet18 3,848,576

Table 4.

Feature dimensionality and sample rate of each of thecompared methods.

Method Dim. HzMFCC 39 101PASE 100 100APC 512 101wav2vec 512 98L1 512 101L1 + Odd 512 101Ours 512 25

Table 5.

Pretraining dataset and duration for each method

Method Pretraining Dataset DurationPASE Librispeech subset 10 hoursAPC Librispeech train-clean-360 360 hourswav2vec Full Librispeech + WSJ 1000 hoursL1 LRW frontal subset 36 hoursL1 + Odd LRW frontal subset 36 hoursOurs LRW frontal subset 36 hours

Pretraining datasets for baselines

The results in Table1 for all the baseline methods (PASE, APC, wav2vec) havebeen computed using the public code and pretrained modelsprovided by the authors. These baseline methods (and ourmethod) have been pretrained on varying amounts and typesof data. For a completely fair comparison, all methodsneed to be pretrained with the same data. We experimentedwith pretraining all baseline methods on the same 36 hourLRW frontal subset that we use for our method. The resultsobtained with the baseline methods using this approach wereeither equivalent or worse to those with the public pretrainedmodels. This shows that our model may be able to learnbetter representations on the same amount of pretrainingdata. However for the results, we use the public pretrainedmodels which may assist with reproducibility.

B. Dataset and split details

Table 6.

The number of data samples in each split of each dataset.

Dataset - % labels Split size (samples)Train Val TestSPC-100% 51088 6798 6835SPC-10% 5097 6798 6835LRW-100% 112812 5878 5987LRW-10% 11054 5878 5987

Table 7.

The duration (in hours) of each split of each dataset.

Dataset - % labels Split duration (hours)Train Val TestSPC-100% 14.19 1.89 1.90SPC-10% 1.41 1.89 1.90LRW-100% 36.35 1.89 1.92LRW-10% 3.56 1.89 1.92

Table 8.

The average number of samples (rounded to nearest inte-ger) and duration in minutes of each class in the training set.

Dataset Classes n/class t/classSPC-100% 30 1703 28.38SPC-10% 30 170 2.82LRW-100% 500 225 4.36LRW-10% 500 22 0.42 earning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

Figure 2.