[PDF] Representation Learning For Speech Recognition Using Feedback Based Relevance Weighting

Abstract

In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting network receives embeddings of the model outputs from the previous time instants as feedback. The proposed relevance weighting scheme allows the respective feature representations to be adaptively selected before propagation to the higher layers. The application of the proposed approach for the task of speech recognition on Aurora-4 and CHiME-3 datasets gives significant performance improvements over baseline systems on raw waveform signal as well as those based on mel representations (average relative improvement of 15% over the mel baseline on Aurora-4 dataset and 7% on CHiME-3 dataset).

Full PDF

RREPRESENTATION LEARNING FOR SPEECH RECOGNITION USING FEEDBACK BASEDRELEVANCE WEIGHTING

Purvi Agrawal and Sriram Ganapathy

Learning and Extraction of Acoustic Patterns (LEAP) lab,Electrical Engineering, Indian Institute of Science, Bangalore, India.

ABSTRACT

In this work, we propose an acoustic embedding based approachfor representation learning in speech recognition. The proposed ap-proach involves two stages comprising of acoustic ﬁlterbank learn-ing from raw waveform, followed by modulation ﬁlterbank learning.In each stage, a relevance weighting operation is employed that actsas a feature selection module. In particular, the relevance weightingnetwork receives embeddings of the model outputs from the pre-vious time instants as feedback. The proposed relevance weight-ing scheme allows the respective feature representations to be adap-tively selected before propagation to the higher layers. The applica-tion of the proposed approach for the task of speech recognition onAurora-4 and CHiME-3 datasets gives signiﬁcant performance im-provements over baseline systems on raw waveform signal as well asthose based on mel representations (average relative improvement of % over the mel baseline on Aurora-4 dataset and % on CHiME-3dataset). Index Terms — Speech representation learning, feedback ofacoustic embeddings, raw speech waveform, 2-stage relevanceweighting, speech recognition.

1. INTRODUCTION

Representation learning deals with the broad set of methods that en-able the learning of meaningful representations from raw data. Simi-lar to machine learning, representation learning can be carried out inan unsupervised fashion like principal component analysis (PCA),t-stochastic neighborhood embeddings (tSNE) proposed by [1] orin supervised fashion like linear discriminant analysis (LDA). Re-cently, deep learning based representation learning has drawn sub-stantial interest. While a lot of success has been reported for textand image domains (for eg., word2vec embeddings [2]), representa-tion learning for speech and audio is still challenging.One of the research directions pursued for speech has been thelearning of ﬁlter banks operating directly on the raw waveform [3–7], mostly in supervised setting. Other efforts attempting unsuper-vised learning of ﬁlterbank have also been investigated. The workin [8] used restricted Boltzmann machine while the efforts in [9]used variational autoencoders. The wav2vec method recently pro-posed by [10] explores unsupervised pre-training for speech recog-nition by learning representations of raw audio. There has beensome attempts to explore interpretability of acoustic ﬁlterbank re-cently, for eg. SincNet ﬁlterbank by [11] and self-supervised learn-ing by [12]. However, compared to vector representations of text

This work was partly funded by grants from the Department of AtomicEnergy (DAE/34/20/12/2018-BRNS/34088) project, and the Ministry of Hu-man Resource and Development (MHRD), Government of India. which have shown to embed meaningful semantic properties, the in-terpretability of speech representations from these approaches hasoften been limited.Subsequent to acoustic ﬁlterbank processing, modulation ﬁlter-ing is the process of ﬁltering the 2-D spectrogram-like representa-tion using 2-D ﬁlters along the time (rate ﬁltering) and frequency(scale ﬁltering) dimension. Several attempts have been made to learnthe modulation ﬁlters also from data. The earliest approaches usingLDA explored the learning of the temporal modulation ﬁlters in asupervised manner [13, 14]. Using deep learning, there have beenrecent attempts to learn modulation ﬁlters in an unsupervised man-ner [15, 16].In this paper, we extend our previous work [17] on joint acous-tic and modulation ﬁlter learning in the ﬁrst two layers of a con-volutional neural network (CNN) operating on raw speech wave-form. The novel contribution of our approach is the incorporationof acoustic embeddings as feedback in the relevance weighting ap-proach. In particular, the relevance weighting network is driven bythe acoustic/modulation ﬁlter outputs along with the embedding ofthe previous one-hot targets. The output of the relevance networkis a relevance weight which multiplies the acoustic/modulation ﬁl-ter [17]. The rest of the architecture performs the task of acousticmodeling for automatic speech recognition (ASR). The approach offeeding the model outputs back to the neural network is also previ-ously reported as a form of recurrent neural network (RNN) calledthe teacher forcing network [18]. However, in this work, the em-beddings of the model outputs are fed back only to the relevanceweighting network and not as a RNN architecture.The ASR experiments are conducted on Aurora-4 (additive noisewith channel artifact) dataset [19], CHiME-3 (additive noise with re-verberation) dataset [20] and VOiCES (additive noise with reverber-ation) dataset [21]. The experiments show that the learned represen-tations from the proposed framework provide considerable improve-ments in ASR results over the baseline methods.

2. RELEVANCE BASED REPRESENTATION LEARNING

The block schematic of the senone embedding network is shown inFigure 1. The entire acoustic model using the proposed relevanceweighting model is shown in Figure 3.

The embedding network (Figure 1) is similar to the skip-gram net-work of word2vec models as proposed in [2]. In this work, theone-hot encoded senone (context dependent triphone hidden Markovmodel (HMM) states modeled in ASR ) target vector at frame t , de-noted as h t , is fed to a network whose ﬁrst layer outputs the embed-ding denoted as e t . This embedding predicts the one-hot target vec- a r X i v : . [ ee ss . A S ] F e b ig. 1 . Block schematic of senone embedding network used in theproposed model. -40 -30 -20 -10 0 10 20 30 40 X -40-30-20-10010203040 Y aoaaaeeyuwtshzhmeng Fig. 2 . t-SNE plot of the senone embeddings for TIMIT dataset.tors for the preceding and succeeding time frames h t − and h t +1 .This model is trained using the ASR labels for each task before theacoustic model training. Once the model is trained, only the em-bedding extraction part (ﬁrst layer outputs) is used in the ﬁnal ASRmodel. We use embeddings of dimensions. During the ASRtesting, the embeddings are derived by feeding the softmax outputsfrom the acoustic model (similar to teacher forcing network by [18]).For the analysis, the TIMIT test set [22] consisting of utter-ances is used. The dataset is hand labelled for phonemes. The t-SNEvisualization of the embeddings is shown in Fig. 2 for phonemesfrom TIMIT test set for a group of vowel phonemes { /ao/, /aa/, /ae/,/ey/, /uw/ } and a group of plosives { /t/ } , fricatives { /sh/, /zh/ } , andnasals { /em/, /eng/ } . As seen in the t-SNE plot of embeddings, theembeddings while being trained on one-hot senones, provides segre-gation of different phoneme types such as vowels, nasals, fricativesand plosives. The input to the neural network are raw samples windowed into S samples per frame with a contextual window of T frames. Eachblock of S samples is referred to as a frame. This input of size S × raw audio samples are processed with a 1-D convolution using F kernels ( F denotes the number of sub-bands in ﬁlterbank decompo-sition) each of size L . The kernels are modeled as cosine-modulatedGaussian function [9, 23], g i ( n ) = cos 2 πµ i n × exp ( − n µ i / (1)where g i ( n ) is the i -th kernel ( i = 1 , .., F ) at time n , µ i is thecenter frequency of the i th ﬁlter (in frequency domain). The meanparameter µ i is updated in a supervised manner for each dataset. Theconvolution with the cosine-modulated Gaussian ﬁlters generates F feature maps which are squared, average pooled within each frameand log transformed. This generates x as F dimensional featuresfor each of the T contextual frames, as shown in Figure 3. The x can be interpreted as the “learned” time-frequency representation(spectrogram). Speech signal[ S x T ]Acoustic FB layerInstance NormModulation filteringConv (1,K, (5,5))Senone classification Filterbank relevance weights [ F x 1] . . . . . . [ S x T ] - D C on v [ , F , ( L , ) ] .. [F x S x T ] Square, Avg Pooling & log( ) Acoustic filterbank relevance sub-networkConv (K,K, (3,3))FC 1 (5440, 2048)FC 2 (2048, 2048)FC 3 (2048, 2048)FC (2048, ] [ F x T ] [ F x T ] [ F x T ] Batch Norm Modulation relevance weights [ K x 1][ K x F’ x T’ ] [ K x F’ x T’ ] [ K x F’ x T’ ] Modulation filterbank relevance sub-networkEmbeddings EmbeddingAcousticrepresentationTanh( ) Tanh( )ConcatenateFC 1FC 2Sigmoid( ) (a) (b)(c) EmbeddingModulationrepresentation FCConcatenateFC 1FC 2Softmax( )RepeatLinearizeEmbeddings

Speech signal[ S x T ]Acoustic FB layerInstance NormModulation filteringConv (1,K, (5,5))Senone classification Filterbank relevance weights [ F x 1] . . . . . . [ S x T ] - D C on v [ , F , ( L , ) ] .. [F x S x T ] Square, Avg Pooling & log( ) Acoustic filterbank relevance sub-networkConv (K,K, (3,3))FC 1 (5440, 2048)FC 2 (2048, 2048)FC 3 (2048, 2048)FC (2048, ] [ F x T ] [ F x T ] [ F x T ] Batch Norm Modulation relevance weights [ K x 1][ K x F’ x T’ ] [ K x F’ x T’ ] [ K x F’ x T’ ] Modulation filterbank relevance sub-networkEmbeddings EmbeddingAcousticrepresentationTanh( ) Tanh( )ConcatenateFC 1FC 2Sigmoid( ) (a) (b)(c) EmbeddingModulationrepresentation FCConcatenateFC 1FC 2Softmax( )RepeatLinearizeEmbeddingsSpeech signal[ S x T ]Acoustic FB layerInstance NormModulation filteringConv (1,K, (5,5))Senone classification Filterbank relevance weights [ F x 1] . . . . . . [ S x T ] - D C on v [ , F , ( L , ) ] .. [F x S x T ] Square, Avg Pooling & log( ) Acoustic filterbank relevance sub-networkConv (K,K, (3,3))FC 1 (5440, 2048)FC 2 (2048, 2048)FC 3 (2048, 2048)FC (2048, ] [ F x T ] [ F x T ] [ F x T ] Batch Norm Modulation relevance weights [ K x 1][ K x F’ x T’ ] [ K x F’ x T’ ] [ K x F’ x T’ ] Modulation filterbank relevance sub-networkEmbeddings EmbeddingAcousticrepresentationTanh( ) Tanh( )ConcatenateFC 1FC 2Sigmoid( ) (a) (b)(c) EmbeddingModulationrepresentation FCConcatenateFC 1FC 2Softmax( )RepeatLinearizeEmbeddings

Fig. 3 . (a) Block diagram of the proposed representation learningapproach from raw waveform, (b) expanded acoustic FB relevancesub-network. Here, x t ( f ) denotes the sub-band trajectory of band f for all frames centered at time t , e t − denotes the acoustic embed-ding vector for previous time step, (c) expanded modulation ﬁlter-bank relevance sub-network. The relevance weighting paradigm for acoustic FB layer is imple-mented using a relevance sub-network fed with the F × T time-frequency representation x and embeddings e of the previous timestep. Let x t ( f ) denote the vector containing sub-band trajectoryof band f for all T frames centered at t (shown in Figure 3(b)).Then, x t ( f ) is concatenated with embeddings of the previous timestep e t − with tanh () non-linearity. This is fed to a two layer deepneural network (DNN) with a sigmoid non-linearity at the output. Itgenerates a scalar relevance weight w a ( t, f ) as the relevance weightcorresponding to the input representation at time t for sub-band f .This operation is repeated for all the F sub-bands which gives a F dimensional weight vector w a ( t ) for the input x t .The F dimensional weights w a ( t ) multiply each column ofthe “learned” spectrogram representation x t to obtain the relevanceweighted ﬁlterbank representation y t . The relevance weights inthe proposed framework are different from typical attention mech-anism [24]. In the proposed framework, relevance weighting isapplied on the representation as soft feature selection weights with-out performing a linear combination. We also process the ﬁrst layeroutputs ( y ) using instance norm [25, 26].In our experiments, we use T = 101 whose center frame is thesenone target for the acoustic model. We also use F = 80 sub-bandsand acoustic ﬁlter length L = 129 . This value of L corresponds to ms in time for a kHz sampled signal. The value of S is ( ms window length) with frame shift of ms. .4. Step-2: Relevance Weighting of Modulation Filtered Rep-resentation The representation z from acoustic ﬁlterbank layer is fed to the sec-ond convolutional layer which is interpreted as modulation ﬁlteringlayer (shown in Figure 3). The kernels of this convolutional layerare 2-D spectro-temporal modulation ﬁlters, learning the rate-scalecharacteristics from the data. The modulation ﬁltering layer gener-ates K parallel streams, corresponding to K modulation ﬁlters w K .The modulation ﬁltered representations p are max-pooled with win-dow of × , leading to feature maps of size F (cid:48) × T (cid:48) . These areweighted using a second relevance weighting sub-network (referredto as the modulation ﬁlter relevance sub-network in Figure 3, ex-panded in Figure 3(c)).The modulation relevance sub-network is fed with feature map p k ; where k = 1 , , ..., K , and embeddings e of the previous timestep. The embedding e is linear transformed and concatenated withthe input feature map. This is fed to a two-layer DNN with softmaxnon-linearity at the output. It generates a scalar relevance weight w m ( k ) corresponding to the input representation at time t ( t as cen-ter frame) for k th feature map. The weights w m are multiplied withthe representation p to obtain weighted representation q . The re-sultant weighted representation q is fed to the batch normalizationlayer [27]. We use the value of K = 40 in the work. Followingthe acoustic ﬁlterbank layer and the modulation ﬁltering layer (in-cluding the relevance sub-networks), the acoustic model consists ofseries of CNN and DNN layers with sigmoid nonlinearity.

3. EXPERIMENTS AND RESULTS

The speech recognition system is trained using PyTorch [28] whilethe Kaldi toolkit [29] is used for decoding and language modeling.The models are discriminatively trained using the training data withcross entropy loss and Adam optimizer [30]. A hidden Markovmodel - Gaussian mixture model (HMM-GMM) system is used togenerate the senone alignments for training the CNN-DNN basedmodel. The ASR results are reported with a tri-gram language modelor using a recurrent neural network language model (RNN-LM).For each dataset, we compare the ASR performance of theproposed approach of learning acoustic representation from rawwaveform with acoustic FB (A) with relevance weighting (A-R)and modulation FB (M) with relevance weighting (M-R) denotedas (A-R,M-R), traditional log mel ﬁlterbank energy (MFB) fea-tures (80 dimension), power normalized ﬁlterbank energy (PFB)features [31], mean Hilbert envelope (MHE) features [32], and ex-citation based (EB) features [33]. We also compare performancewith the SincNet method proposed in [11]. Note that the modulationﬁltering layer (M) is part of the baseline model, and hence notationM is not explicitly mentioned in the discussion. The neural networkarchitecture shown in Figure 3 (except for the acoustic ﬁlterbanklayer, the acoustic FB relevance sub-network and modulation ﬁlterrelevance sub-network) is used for all the baseline features.

This database consists of read speech recordings of words cor-pus, recorded under clean and noisy conditions (street, train, car,babble, restaurant, and airport) at − dB SNR. The trainingdata has multi condition recordings ( speakers) with total hours of training data. The validation data has recordings formulti condition setup. The test data has recordings ( speakers)for each of the clean and noise conditions. The test data are clas-siﬁed into group A - clean data, B - noisy data, C - clean data withchannel distortion, and D - noisy data with channel distortion. Table 1 . Word error rate (%) for different conﬁgurations of the pro-posed model for the ASR task on Aurora-4 dataset.

Features ASR (WER in %)

A B C D Avg.Baseline Raw Waveform (A,M) 4.1 6.8 7.3 16.2 10.7Acoustic RelevanceA-R,M [Softmax, no embedding] [17] 3.6 6.4 8.1 15.1 10.0A-R,M [Sigmoid, no embedding] 3.4 6.4 6.7 15.5 9.9A-R,M [Sigmoid, with senone embedding] 3.4 6.2 6.7 14.5

Acoustic Relevance & Mod. RelevanceA-R,M-R [Softmax, no embedding] [17] 3.6 6.1

Table 2 . Word error rate (%) in Aurora-4 database with various fea-ture extraction schemes with decoding using trigram LM (and RNN-LM in paranthesis).

Cond

MFB PFB MHE EB Sinc MFB-R S-R,M-R A-R,M-RA. Clean with same MicClean 4.2 4.0 3.8 3.7 4.0 3.9 3.8

B: Noisy with same MicAirport 6.8 7.1 7.3 - 6.9 6.7 6.2

Babble 6.6 7.4 7.4 - 6.7 6.5 6.1

Car 4.0 4.5 4.3 - 4.0 4.1 3.9

Rest. 9.4 9.6 9.1 - 9.4 9.6 8.4

Street 8.1 8.1 7.6 - 8.4 8.4 7.5

Train 8.4 8.6 8.6 - 8.3 8.2 7.4

Avg. 7.2 7.5 7.4 6.0 7.3 7.2 6.6

C: Clean with diff. MicClean 7.2 7.3 7.3

Car 8.6 11.2 9.6 - 9.0 8.9 7.9

Rest. 18.8 21.0 20.1 - 19.0 18.8 19.2

Street 17.3 19.5 18.8 - 17.3 17.8 16.6

Train 17.6 18.8 18.7 - 18.1 17.9 16.6

Avg. 15.9 17.9 17.3 15.8 16.2 16.1 15.1

Avg. of all conditionsAvg. 10.7 11.7 11.4 9.9 10.8 10.8 9.9

The ASR performance on the Aurora-4 dataset is shown in Ta-ble 1 for various conﬁgurations of the proposed approach and Ta-ble 2 for different baseline features. In order to observe the impactof different components of the proposed model, we tease apart thecomponents and measure the ASR performance (Table 1). The ﬁfthrow (A-R,M-R, softmax with no-embedding) refers to the previousattempt using the 2-stage ﬁlter learning reported in [17]. In this pa-per, we explore the variants of the proposed model such as use ofsoftmax nonlinearity instead of sigmoid in both relevance weightingsub-networks, sigmoid in both relevance weighting sub-networks,without and with senone embedding, and the 2-stage approach (bothrelevance weighting sub-networks). Among the variants with acous-tic relevance weighting alone, the A-R [sigmoid with senone embed-dings] improves over the softmax nonlinearity. With joint A-R,M-Rcase, again the sigmoid with senone embeddings provides the bestresult.While comparing with different baseline features in Table 2, itcan be observed that most of the noise robust front-ends do not im-prove over the baseline mel ﬁlterbank (MFB) performance. The rawwaveform acoustic FB performs similar to MFB baseline featureson average while performing better than the baseline for Cond. Aand B. The ASR system with MFB-R features, which denote the ap-plication of the acoustic FB relevance weighting over the ﬁxed melﬁlterbank features, also does not yield improvements over the sys-tem with baseline MFB features. We hypothesize that the learningof the relevance weighting with learnable ﬁlters allows more free-dom in learning the model compared to learning with ﬁxed mel ﬁl- able 3 . Word error rate (%) in CHiME-3 Challenge database formulti-condition training.

Test Cond MFB PFB RAS MHE A-R A-R,M-RSim dev 12.9 13.3 14.7 13.0 12.4

Real dev 9.9 10.7 11.4 10.2 9.9

Avg. 11.4 12.0 13.0 11.6 11.2

Sim eval 19.8 19.4 22.7 19.7 19.0

Real eval 18.3 19.2 20.5 18.5 17.2

Avg. 19.1 19.3 21.6 19.1 18.1

Table 4 . WER (%) for cross-domain ASR experiments.

Filters ASR Trained and Tested onLearned on

Aurora-4 CHiME-3Aurora-4 9.1 14.3CHiME-3 9.2 14.2 ters. The proposed (A-R,M-R) representation learning (two-stagerelevance weighting) provides considerable improvements in ASRperformance over the baseline system with average relative improve-ments of % over the baseline MFB features. Furthermore, the im-provements in ASR performance are consistently seen across all thenoisy test conditions and with a sophisticated RNN-LM. In addition,the performance achieved is also considerably better than the resultssuch as excitation based features (EB) reported by [33].For comparison with the SincNet method by [11], our cosinemodulated Gaussian ﬁlterbank is replaced with the sinc ﬁlterbankas kernels in ﬁrst convolutional layer (acoustic FB layer in Fig. 3).The ASR system with sinc FB (Sinc) is trained jointly without anyrelevance weighting keeping rest of the architecture same as shownin Fig. 3. From results, it can be observed that the parametric sincFB (without relevance weighting) performs similar to MFB and alsoour learned ﬁlterbank A. In addition, the relevance weighting withSinc ﬁlterbank (S-R,M-R) results show that the relevance weightingis also applicable to other prior works on learnable front-ends. The CHiME-3 corpus for ASR contains multi-microphone tablet de-vice recordings from everyday environments, released as a part of3rd CHiME challenge [20]. Four varied environments are present -cafe (CAF), street junction (STR), public transport (BUS) and pedes-trian area (PED). For each environment, two types of noisy speechdata are present - real and simulated. The real data consists of -channel recordings of sentences from the WSJ corpus spoken inthe environments listed above. The simulated data was constructedby artiﬁcially mixing clean utterances with environment noises. Thetraining data has (real) noisy recordings and simulatednoisy utterances, constituting a total of hours of training data. Weuse the beamformed audio in our ASR training and testing. The de-velopment (dev) and evaluation (eval) data consists of and utterances respectively. For each set, the sentences are read by fourdifferent talkers in the four CHiME-3 environments. This results in ( × ) and ( × ) real development and evaluationutterances.The results for the CHiME-3 dataset are reported in Table 3.The ASR system with SincNet performs similar to baseline MFBfeatures. The initial approach of raw waveform ﬁlter learning withacoustic FB relevance weighting (A-R) improves over the baselinesystem as well as the other multiple noise robust front-ends consid-ered here. The proposed approach of 2-stage relevance weightingover learned acoustic and modulation representations provides sig-niﬁcant improvements over baseline features (average relative im-provements of % over MFB features in the eval set). W o r d E rr o r R a t e ( % ) Dev EvalMFB Prop

Fig. 4 . ASR performance in WER (%) for VOiCES database.

In a subsequent analysis, we perform a cross-domain ASR exper-iment, i.e., we use the acoustic ﬁlterbank learned from one of thedatasets (either Aurora-4 or CHiME-3 challenge) to train/test ASRon the other dataset. The results of these cross-domain ﬁlter learn-ing experiments are reported in Table 4. The rows in the table showthe database used to learn the acoustic FB and the columns show thedataset used to train and test the ASR (all other layers in Figure 3are learned in the ASR task). The performance reported in this ta-ble are the average WER on each of the datasets. The results shownin Table 4 illustrate that the ﬁlter learning process is relatively ro-bust to the domain of the training data, suggesting that the proposedapproach can be generalized for other “matched” tasks.

The Voices Obscured in Complex Environmental Settings (VOiCES)corpus is a creative commons speech dataset being used as part ofVOiCES Challenge [21]. The training data set of hours has , utterances sampled at kHz from speakers, with eachutterance having − s segments of read speech. We performeda 1-fold reverberation and noise augmentation of the data usingKaldi [29]. The ASR development set consists of hours of dis-tant recordings from the VOiCES dev speakers. It containsrecordings from microphones. The evaluation set consists of hours of distant recordings from the VOiCES eval speakers andcontains recordings from microphones. The ASR performanceon VOiCES dataset with baseline MFB features and our proposedapproach (A-R,M-R) of 2-step relevance weighting is reported inFigure 4. These results suggest that the proposed model is also scal-able to relatively larger ASR tasks where consistent improvementscan be obtained with the proposed approach.

4. SUMMARY

The summary of the work is as follows.• Extending the previous efforts in 2-stage relevance weight-ing approach with the use of embeddings feedback from pastprediction.• Incorporating the feedback in the form of word2vec stylesenone embedding for the task of learning representations.• Performance gains in terms of word error rates for multipleASR tasks.

5. REFERENCES [1] Laurens van der Maaten and Geoffrey Hinton, “Visualizingdata using t-sne,”

Journal of machine learning research , vol.9, no. Nov, pp. 2579–2605, 2008.2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,“Efﬁcient estimation of word representations in vector space,”

Proc. of International Conference on Learning Representa-tions (ICLR), arXiv preprint arXiv:1301.3781 , 2013.[3] Dimitri Palaz, Ronan Collobert, and Mathew Magimai Doss,“Estimating phoneme class conditional probabilities from rawspeech signal using convolutional neural networks,”

Proceed-ings of Interspeech , pp. 1766–1770, 2013.[4] Tara N Sainath, Brian Kingsbury, Abdel Rahman Mohamed,and Bhuvana Ramabhadran, “Learning ﬁlter banks within adeep neural network framework,” in

IEEE Workshop on Auto-matic Speech Recognition and Understanding (ASRU) , 2013,pp. 297–302.[5] Zolt´an T¨uske, Pavel Golik, Ralf Schl¨uter, and Hermann Ney,“Acoustic modeling with deep neural networks using raw timesignal for LVCSR,” in

Proc. of Interspeech , 2014, pp. 890–894.[6] Yedid Hoshen, Ron J Weiss, and Kevin W Wilson, “Speechacoustic modeling from raw multichannel waveforms,” in

IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2015, pp. 4624–4628.[7] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson,and Oriol Vinyals, “Learning the speech front-end with rawwaveform CLDNNs,” in

Proc. of Interspeech , 2015, pp. 1–5.[8] Hardik B Sailor and Hemant A Patil, “Filterbank learningusing convolutional restricted boltzmann machine for speechrecognition.,” in

IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2016, pp. 5895–5899.[9] Purvi Agrawal and Sriram Ganapathy, “Unsupervised rawwaveform representation learning for ASR,”

Proc. of Inter-speech 2019 , pp. 3451–3455, 2019.[10] Steffen Schneider, Alexei Baevski, Ronan Collobert, andMichael Auli, “wav2vec: Unsupervised pre-training for speechrecognition,”

Proc. of Interspeech , pp. 3465–3469, 2019.[11] Mirco Ravanelli and Yoshua Bengio, “Interpretable convolu-tional ﬁlters with SincNet,” in Proc. of Neural InformationProcessing Systems (NIPS) , 2018.[12] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, Antonio Bona-fonte, and Yoshua Bengio, “Learning problem-agnostic speechrepresentations from multiple self-supervised tasks,”

Proc. ofInterspeech , pp. 161–165, 2019.[13] Sarel Van Vuuren and Hynek Hermansky, “Data-driven designof RASTA-like ﬁlters.,” in

Eurospeech , 1997, vol. 1, pp. 1607–1610.[14] Jeih-Weih Hung and Lin-Shan Lee, “Optimization of temporalﬁlters for constructing robust features in speech recognition,”

IEEE Transactions on Audio, Speech, and Language Process-ing , vol. 14, no. 3, pp. 808–832, 2006.[15] Hardik B Sailor and Hemant A Patil, “Unsupervised learn-ing of temporal receptive ﬁelds using convolutional rbm for asrtask,” in

European Signal Processing Conference (EUSIPCO) .IEEE, 2016, pp. 873–877.[16] Purvi Agrawal and Sriram Ganapathy, “Modulation ﬁlterlearning using deep variational networks for robust speechrecognition,”

IEEE Journal of Selected Topics in Signal Pro-cessing , vol. 13, no. 2, pp. 244–253, 2019. [17] Purvi Agrawal and Sriram Ganapathy, “Interpretable represen-tation learning for speech and audio signals based on relevanceweighting,”

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 28, pp. 2823–2836, 2020.[18] Ronald J Williams and David Zipser, “A learning algorithm forcontinually running fully recurrent neural networks,”

Neuralcomputation , vol. 1, no. 2, pp. 270–280, 1989.[19] Hans-G¨unter Hirsch and David Pearce, “The Aurora exper-imental framework for the performance evaluation of speechrecognition systems under noisy conditions,” in

ASR2000-Automatic Speech Recognition: Challenges for the new Mil-lenium ISCA Tutorial and Research Workshop (ITRW) , 2000.[20] Jon Barker, Ricard Marxer, Emmanuel Vincent, and ShinjiWatanabe, “The third ‘CHiME’ speech separation and recogni-tion challenge: Dataset, task and baselines,” in

IEEE Workshopon ASRU , 2015, pp. 504–511.[21] Mahesh Kumar Nandwana, Julien Van Hout, Colleen Richey,Mitchell McLaren, Maria Alejandra Barrios, and Aaron Law-son, “The VOiCES from a distance challenge 2019,”

Proc. ofInterspeech , pp. 2438–2442, 2019.[22] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan GFiscus, and David S Pallett, “DARPA TIMIT acoustic-phoneticcontinous speech corpus,”

NASA STI/Recon technical report ,vol. 93, 1993.[23] Purvi Agrawal and Sriram Ganapathy, “Robust raw waveformspeech recognition using relevance weighted representations,”in

Proc. of Interspeech , 2020, pp. 1649–1653.[24] Yu Zhang, Pengyuan Zhang, and Yonghong Yan, “Attention-based LSTM with multi-task learning for distant speech recog-nition,”

Proc. of Interspeech , pp. 3857–3861, 2017.[25] David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams, “Learning representations by back-propagating er-rors,”

Nature , vol. 323, no. 6088, pp. 533, 1986.[26] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky, “In-stance normalization: The missing ingredient for fast styliza-tion,” arXiv preprint arXiv:1607.08022 , 2016.[27] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,”

Proc. of ICML , pp. 448–456, 2015.[28] Adam Paszke, Sam Gross, Soumith Chintala, and GregoryChanan, “PyTorch,”

Computer software. Vers. 0.3 , vol. 1,2017.[29] Daniel Povey et al., “The KALDI speech recognition toolkit,”in

IEEE ASRU . IEEE Signal Processing Society, 2011, numberEPFL-CONF-192584.[30] Diederik P Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,”

Proc. of ICLR, arXiv preprintarXiv:1412.6980 , 2015.[31] Chanwoo Kim and Richard M Stern, “Power-normalized cep-stral coefﬁcients (PNCC) for robust speech recognition,” in

ICASSP , 2012, pp. 4101–4104.[32] Seyed Omid Sadjadi, Tauﬁq Hasan, and John HL Hansen,“Mean Hilbert envelope coefﬁcients (MHEC) for robustspeaker recognition,” in

Proc. of Interspeech , 2012.[33] Thomas Drugman, Yannis Stylianou, Langzhou Chen, XieChen, and Mark JF Gales, “Robust excitation-based featuresfor automatic speech recognition,” in