Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski
aa r X i v : . [ ee ss . A S ] A ug OVERVIEW PAPER SUBMITTED TO OJSP 1
Adaptation Algorithms for Speech RecognitionAn Overview
Peter Bell,
Member, IEEE , Joachim Fainberg,
Member, IEEE , Ondrej Klejch,
Member, IEEE ,Jinyu Li,
Member, IEEE , Steve Renals,
Fellow, IEEE , and Pawel Swietojanski,
Member, IEEE
Abstract —We present a structured overview of adaptation algo-rithms for neural network-based speech recognition, consideringboth hybrid hidden Markov model / neural network systemsand end-to-end neural network systems, with a focus on speakeradaptation, domain adaptation, and accent adaptation. Theoverview characterizes adaptation algorithms as based on embed-dings, model parameter adaptation, or data augmentation. Wepresent a meta-analysis of the performance of speech recognitionadaptation algorithms, based on relative error rate reductions asreported in the literature.
Index Terms —Speech recognition, speaker adaptation, speakerembeddings, structured linear transforms, regularization, dataaugmentation, domain adaptation, accent adaptation, semi-supervised learning
I. I
NTRODUCTION T HE performance of automatic speech recognition (ASR)systems has improved dramatically in recent years thanksto the availability of larger training datasets, the developmentof neural network based models, and the computational powerto train such models on these datasets [1]–[4]. However, theperformance of ASR systems can still degrade rapidly whentheir conditions of use (test conditions) differ from the trainingdata. There are several causes for this, including speakerdifferences, variability in the acoustic environment, and thedomain of use.Adaptation algorithms attempt to alleviate the mismatchbetween the test data and an ASR system’s training data.Adapting an ASR system is a challenging problem sinceit requires the modification of large and complex models,
Authors have made equal contributions and are listed alphabetically.Peter Bell, Joachim Fainberg, Ondrej Klejch, and Steve Renals are withthe Centre for Speech Technology Research, University of Edinburgh, Ed-inburgh EH8 9AB, UK (email: [email protected], [email protected],[email protected], [email protected]).Jinyu Li is with the Microsoft Corporation, Redmond, WA 98052 USA(e-mail [email protected])Pawel Swietojanski did this work at the School of Computer Science andEngineering, University of New South Wales, Sydney, NSW 2052, Australia.He is now with Apple, Cambridge, UK (email: [email protected]).This work was partially supported by the EPSRC Project EP/R012180/1(SpeechWave), a PhD studentship funded by Bloomberg, and the EU H2020project ELG (grant agreement 825627).This research is based upon work supported in part by the Office of theDirector of National Intelligence (ODNI), Intelligence Advanced ResearchProjects Activity (IARPA), via Air Force Research Laboratory (AFRL)contract typically using only a small amount of target data and withoutexplicit supervision. Speaker adaptation – adapting the systemto a target speaker – is the most common form of adaptation,but there are other important adaptation targets such as thedomain of use, and the spoken accent. Much of the workin the area has focused on speaker adaptation: it is the casethat many approaches developed for speaker adaptation do notexplicitly model speaker characteristics, and can be applied toother adaptation targets. Thus our core treatment of adaptationalgorithms is in the context of speaker adaptation, with a laterdiscussion of particular approaches for domain adaptation andaccent adaptation.Adaptation algorithms require a data set for adaptation,which should be well-matched to the target test data. Inthe ideal case, the adaptation data would be labeled witha gold-standard transcription, to enable supervised learningalgorithms to be used for adaptation. However, superviseddata is rarely available: small amounts may be available forsome domain adaptation tasks (for example, adapting a systemtrained on typical speech to disordered speech [5]). In theusual case, where supervised adaptation data is not available,supervised training algorithms can still be used with “pseudo-labels” obtained from a trained (non-adapted) system by semi-supervised training [6] or by teacher-student training [7].Alternatively, unsupervised training can be applied to learnembeddings for the different adaptation classes, such as i-vectors [8] or bottleneck features extracted from an auto-encoder neural network [9].A second aspect of annotation required for adaptation islabeling of the adaptation class. Adaptation to the speakercan only reliably take place if there is metadata containingthis information. In some cases – for example lecture record-ings and telephony – this may be available. In other casespotentially inaccurate metadata is available, for instance inthe transcription of television or online broadcasts. In manycases (for instance, anonymous voice search) speaker metadatais not available. In the absence of speaker metadata, thenthe adaptation can take place at the utterance level [10], orautomatic clustering approaches can be used to define theadaptation classes [11], [12]. This is discussed in Sec. II.This overview focuses on the adaptation of neural net-work (NN) based speech recognition systems, although webriefly discuss earlier approaches to speaker adaptation inSec. III. Speaker adaptation algorithms for hidden Markovmodel (HMM) based systems are reviewed in more detail byWoodland [13] and Shinoda [14]. As we discuss, some of thealgorithms developed for HMM-based systems, in particular
VERVIEW PAPER SUBMITTED TO OJSP 2 feature transformation approaches have been successfully ap-plied to NN-based systems.NN-based systems [1], [15], [16] have revolutionized thefield of speech recognition, and there has been intense activityin the development of adaptation algorithms for such systems.Adaptation of NN-based speech recognition is an excitingresearch area for at least two reasons: from a practical pointof view, it is important to be able to adapt state-of-the-artsystems; and from a theoretical point of view the fact that NNsrequire fewer constraints on the input than a Gaussian-basedsystem, along with the gradient-based discriminative trainingwhich is at the heart of most NN-based speech recognitionsystems, opens a range of possible adaptation algorithms.Neural networks were first applied to speech recognitionas so-called NN/HMM hybrid systems , in which the neuralnetwork is used to estimate (scaled) likelihoods that act as theHMM state observation probabilities [15] (Fig. 1a). During the1990s both feed-forward networks [15] and recurrent neuralnetworks (RNNs) [17] were used in such hybrid systemsand close to state-of-the-art results were obtained [18]. Thesesystems were largely context-independent, although context-dependent NN-based acoustic models were also explored [19].The modeling power of neural network systems at thattime was computationally limited, and they were not ableto achieve the precise levels of modeling obtained usingcontext-dependent GMM-based HMM systems which becamethe dominant approach. However, increases in computationalpower enabled deeper neural network models to be learnedalong with context-dependent modeling using the same num-ber of context-dependent HMM tied states (senones) as GMM-based systems [1], [2]. This lead to the development of systemssurpassing the accuracy of GMM-based systems. This increasein computational power also enabled more powerful neuralnetwork models to be employed, in particular time-delayneural networks (TDNNs) [20], [21], convolutional neuralnetworks (CNNs) [22], [23], long short-term memory (LSTM)RNNs [24], [25], and bidirectional LSTMs [26], [27].Since 2015, there has been a significant trend in the fieldmoving from hybrid HMM/NN systems to end-to-end (E2E)NN modeling [4], [16], [28]–[34] for ASR. E2E systems arecharacterized by the use of a single model transforming theinput acoustic feature stream to a target stream of outputtokens, which might be constructed of characters, subwords,or even words. E2E models are optimized using a single ob-jective function, rather than comprising multiple components(acoustic model, language model, lexicon) that are optimizedindividually. Currently, the most widely used E2E modelsare connectionist temporal classification (CTC) [35], [36], theRNN Transducer (RNN-T) model [31], [37], and the attention-based encoder-decoder (AED) model [16], [28].CTC and the RNN-T both map an input speech featuresequence to an output label sequence, where the label sequence(typically characters) is considerably shorter than the inputsequence. Both of these architectures use an additional blankoutput token to deal with the sequence length differences, withan objective function which sums over all possible alignmentsusing the forward backward algorithm [38]. CTC is an earlier,and simpler, method which assumes frame independence and functions similarly to the acoustic model in hybrid systemswithout modeling the linguistic dependency across words; itsarchitecture is similar to that of the neural network in thehybrid system (Fig. 1a).An RNN-T (Fig. 1b) combines an additional predictionnetwork with the acoustic encoder. The prediction network isan RNN modeling linguistic dependencies whose input is thepreviously output symbol. Since the prediction network doesnot use the speech data, it is possible to train it on additionaltext data. The acoustic encoder and the prediction network arecombined using a feed-forward joint network followed by asoftmax to prediuct the next output token given the speechinput input and the linguistic context.Together, the RNN-T’s prediction and joint networks may beregarded as a decoder, and we can view the RNN-T as a formof encoder-decoder system. The AED architecture (Fig. 1c)enriches this model with an additional attention network whichinterfaces the acoustic encoder with the decoder. The attentionnetwork operates on the entire sequence of encoder repre-sentations for an utterance, offering the decoder considerablymore flexibility. A detailed comparison of popular E2E modelsin both streaming and non-streaming modes with large scaletraining data was conducted by Li et al [39].We present a general framework for adaptation of NN-based speech recognition systems (both hybrid and E2E) inSec. IV, where we organize adaptation algorithms into threegeneral categories: embedding-based approaches (discussed inSec. V), model-based approaches (discussed in Secs. VI–VIII),and data augmentation approaches (discussed in Sec. IX).As mentioned above, our treatment of adaptation algorithmsis in the context of speaker adaptation. In Secs. X and XI wediscuss specific approaches to accent adaptation and domainadaptation respectively.Our primary focus is on the adaptation of acoustic modelsand end-to-end models. In Sec. XII we provide a summaryof work in language model (LM) adaptation, mentioning bothn-gram and neural network language models, and the use ofLM adaptation in E2E systems.Adaptation and transfer learning have become important andintensively researched topics in other areas related to machinelearning, most notably computer vision and natural languageprocessing (NLP). In both these cases the motivation is to trainpowerful base models using large amounts of training data,then to adapt these to specific tasks or domains, for whichconsiderably less training data is available.In computer vision, the base model is typically a largeconvolutional network trained to perform image classificationor object recognition using the ImageNet database [40], [41].The ImageNet model is then adapted to a lower resource task,such as computer-aided detection in medical imaging [42].Kornblith et al [43] have investigated empirically how wellImageNet models transfer to different tasks and datasets.Transfer learning in NLP differs from computer vision,and from the speech recognition approaches discussed in thispaper, in that the base model is trained in an unsupervisedfashion to perform language modeling or a related task,typically using web-crawled text data. Base models used forNLP include the bidirectional LSTM [44] and Transformers
VERVIEW PAPER SUBMITTED TO OJSP 3
EncoderSoftmax x t h enct P ( y t | x t ) (a) SoftmaxJointPredictionEncoder D ec od e r x t y u − h enct h preu z t,u P ( y u | x t , y u − ) (b) EncoderAttentionDecoderSoftmax x t h enc T h decu − c u y u − h decu P ( y u | x T , y u − ) (c) Fig. 1: NN architectures used for hybrid NN/HMM and end-to-end (CTC, RNN-T, AED) speech recognition systems: (a)Scheme of NN architecture used for NN/HMM hybrid systems and for connectionist temporal classification (CTC); (b)architecture for the RNN Transducer (RNN-T); (c) architecture for attention based encoder-decoder (AED) end-to-endsystems.which make use of self-attention [45], [46]. These models arethen trained on specific NLP tasks, with supervised trainingdata, which is specified in a common format ( e.g. text-to-text transfer [46]), often trained in a multi-task setting. Earlieradaptation approaches in NLP focused on feature adaptation(e.g. [47]), but more recently better results have been obtainedusing model-based adaptation, for instance “adapter layers”[46], [48], in which trainable transform layers are insertedinto the pretrained base model.More broadly there has been extensive work on domainadaptation and transfer learning in machine learning, reviewedby Kouw and Loog [49]. This includes work on few-shotlearning [50]–[52] and normalizing flows [53], [54]. Normaliz-ing flows which provide a probabilistic framework for featuretransformations, were first developed for speech recognitionas Gaussianization [55], and more recently have been appliedto speech synthesis [56] and voice transformation [57].Finally we provide a meta analysis of experimental studiesusing the main adaptation algorithms that we have discussed(Sec. XIII). The meta-analysis is based on experiments re-ported in 45 papers, carried out using 33 datasets, and isprimarily based on the relative error rate reduction arisingfrom adaptation approaches. In this section we analyze theperformance of the main adaptation algorithms across a vari-ety of adaptation target types (for instance speaker, domain,and accent), in supervised and unsupervised settings, in sixdifferent languages, and using six different NN model typesin both hybrid and end-to-end settings. II. I
DENTIFYING ADAPTATION TARGETS
Adaptation aims to reduce the mismatch between trainingand test conditions. For an adaptation algorithm to be effective,the distribution of the adaptation data should be close tothat encountered in test conditions. For the task of acousticadaptation this requirement is typically satisfied by forming theadaptation data from one or more speech segments from giventesting conditions ( i.e. the same speaker, accent, domain, oracoustic environment). While for some tasks labels ascribed tospeech segments may exist, allowing segments to be groupedinto larger adaptation clusters, it is unrealistic to assume theavailability of such metadata in general. However, dependingon the application and the operating regime of the ASR system,it may be possible to derive reasonable proxies.
Utterance -level adaptation derives adaptation statistics usinga single speech segment. This waives the requirement to carryinformation about speaker identity between utterances, whichmay simplify deployment of recognition system – in termsof both engineering and privacy – as one does not need toestimate and store offline speaker-specific information. On theother hand, owing to the small amounts of data availablefor adaptation the gains are usually lower that one couldobtain with speaker-level clusters. While many approachesuse utterances to directly extract corresponding embeddingsto use as an auxiliary input for the acoustic model [8], [58]–[60], one can also build a fixed inventory of speaker, domains,or topic codes [61] or embeddings [62], [63] when learningthe acoustic model or acoustic encoder, and then use thetest utterance to select a combination of these at test stage.The latter approach alleviates the necessity of estimating an
VERVIEW PAPER SUBMITTED TO OJSP 4 accurate representation from small amounts of data. It may bepossible to relax the utterance-level constraint by iterativelyre-estimating adaptation statistics using a number of precedingsegment(s) [58]. Extra care usually needs to be taken to handlesilence and speech uttered by different speakers, as failing todo so may deteriorate the overall ASR performance [63]–[65].
Speaker -level adaptation aggregates statistics across two ormore segments uttered by the same talker, requiring a way togroup adaptation utterances produced by different talkers. Thegeneric approach to this problem relies on a speaker diariza-tion system [66], that can identify speakers and accordinglyassign their identities to the corresponding segments in therecordings. This is often used in the offline transcription ofmeetings or broadcast media. Some transcription tasks, suchas lectures or telephone conversations, allow the assumptionof speaker identifies across a whole recording or conversationside. Such an approach is likely to result in the estimation ofmany adaptation transforms for the same (physical) speakerappearing across multiple recordings, but this is not an issueif there is enough acoustic material in each recording.
Domain -level adaptation broadens the speaker-level clusterby including speech produced by multiple talkers characterizedby some common characteristic such as accent, age, medicalcondition, topic, etc. . This typically results in more adaptationmaterial and an easier annotation process (cluster labels needto be assigned at batch rather than segment level). As such,domain adaptation can usually leverage adaptation transformswith greater capacity, and thus offer better adaptation gains.Depending on whether adaptation transforms are estimatedon held out data, or adaptation is iteratively derived fromtest segments, we will refer to these as enrolment or on-line modes, respectively. Semi-supervised techniques refersto unsupervised learning that requires training targets whichare automatically produced from a seed model. A two-pass system is a special case for which the necessary statistics areestimated from test data using the first pass decoding witha speaker-independent model in order to obtain adaptationlabels, followed by a second pass with the speaker-adaptedmodel. Finally, the enrolment approach can be estimatedin either supervised or unsupervised modes, depending onwhether the adaptation targets on held-out data were derived ina manual or automatic way. For semi-supervised approaches,itis possible to further filter out regions with low-confidence toavoid the reinforcement of potential errors [67]–[69]. Thereis some evidence in the literature that, for some limited-in-capacity transforms estimated in semi-supervised manner, thefirst pass transcript quality has a small impact on the adaptedaccuracy as long as these are obtained with the correspondingspeaker-independent model [70], [71]. In lattice supervision multiple possible transcriptions are used in a semi-supervisedsetting by generating a lattice or graph, rather than the one-besttranscription [72]–[75].III. A DAPTATION ALGORITHMS FOR
HMM-
BASED
ASRSpeaker adaptation of speech recognition systems has beeninvestigated since the 1960s [76], [77]. In the mid-1990s, theinfluential maximum likelihood linear regression (MLLR) [78] and maximum a posteriori (MAP) [79] approaches to speakeradaptation for HMM/GMM systems were introduced. Thesemethods, described below, stimulated the field leading an in-tense activity in algorithms for the adaptation of HMM/GMMsystems, reviewed by Woodland [13] and in section 5 ofGales and Young’s broader review of HMM-based speechrecognition [80].In this section we review MAP, MLLR, and related ap-proaches to the adaptation of HMM/GMM systems, alongwith earlier approaches to speaker adaptation. Many of theseearly approaches were designed to normalize speaker-specificcharacteristics, such as vocal tract length, building on lin-guistic findings relating to speaker normalization in speechperception [81], often casting the problem as one of spectralnormalization. This work included formant-based frequencywarping approaches [76], [77], [82], and the estimation oflinear projections to normalize the spectral representation to aspeaker-independent form [83], [84].Vocal tract length normalization (VTLN) was introducedby Wakita [85] (and again by Andreou [86]) as a form offrequency warping with the aim to compensate for vocal tractlength differences across speakers. VTLN was extensivelyinvestigated for speech recognition in the 1990s and 2000s[87]–[90], and is discussed further in Sec. V.In model based adaptation, the speech recognition modelis used to drive the adaptation. In work prefiguring subspacemodels, Furui [91] showed how speaker specific models couldbe estimated from small amounts of target data in a dynamictime warping setting, learning linear transforms between pre-existing speaker-dependent phonetic templates, and templatesfor a target speaker. Similar techniques were developed inthe 1980s by adapting the vector quantization (VQ) used indiscrete HMM systems. Shikano, Nakamura, and Abe [92]showed that mappings between speaker dependent codebookscould be learned to model a target speaker (a technique widelyused for voice conversion [93]); Feng et al [94] developeda VQ-based approach in which speaker-specific mappingswere learned between codewords in a speaker-independentcodebook, in order to maximize the likelihood of the discreteHMM system. Rigoll [95] introduced a related approach inwhich the speaker-specific transform took the form of aMarkov model. A continuous version of this approach, referredto as probabilistic spectrum fitting, which aimed to adjust theparameters of a Gaussian phonetic model was introduced byHunt [96] and further developed by Cox and Bridle [97].These probabilistic spectral modeling approaches can beviewed as precursors to maximum likelihood linear regres-sion (MLLR) introduced by Leggetter and Woodland [78]and generalized by Gales [98]. MLLR applies to continuousprobability density HMM systems, composed of Gaussianprobability density functions. In MLLR, linear transforms areestimated to adapt the mean vectors and covariance matricesof the Gaussian components. If µ and Σ are the mean vectorand covariance matrix of a Gaussian, then MLLR adapts theparameters as follows, where A s , b s , and H s are the MLLR VERVIEW PAPER SUBMITTED TO OJSP 5 parameters for speaker s : ˆ µ s = A s µ − b s (1) ˆΣ s = H s Σ [ H s ] ⊺ . (2)The MLLR parameters are estimated using maximum likeli-hood. For efficient computation, the likelihood can be com-puted using the following [98]: L MLLR ( x ; µ, Σ , A s , b s , H s ) =log 1det H s N ([ H s ] − x ; [ H s ] − A s µ + [ H s ] − b, Σ) . (3)MLLR is a compact adaptation technique since the transformsare shared across Gaussians: for instance all Gaussians cor-responding to the same monophone might share mean andcovariance transforms. Very often, especially when target datais sparse, a greater degree of sharing is employed – for instancetwo shared adaptation transforms, one for Gaussians in speechmodels and one for Gaussians in non-speech models.Constrained MLLR [98], [99], is an important variant ofMLLR, in which the same transform is used for both the meanand covariance: ˆ µ = A ′ µ − b ′ (4) ˆΣ = A ′ Σ A ′ ⊺ (5) A ′ = A − (6) b ′ = A b . (7)In this case, the likelihood is given by L fMLLR = N ( A x n + b ; µ, Σ) + log(det A ) . (8)It can be seen that this transform of the model parametersis equivalent to applying a linear transform to the the data– hence constrained MLLR is often referred to as feature-space MLLR (fMLLR), although it is not strictly feature-space adaptation unless a single transform is shared acrossall Gaussians in the system. MLLR and its variants have beenused extensively in the adaptation of Gaussian mixture model(GMM)-based HMM speech recognition systems [13], [80].The above model-based adaptation approaches have aimedto estimate transforms between a speaker independent modeland a model adapted to a target speaker. An alternativeBayesian approach attempts to perform the adaptation byusing the speaker independent model to inform the prior ofa speaker-adapted model. If the set of parameters of a speechrecognition model are denoted by θ , then maximum likelihoodestimation sets θ to maximize the likelihood p ( X | θ ) . In MAPtraining, the estimation procedure maximizes the posterior ofthe parameters given the data: P ( θ | X ) ∝ p ( X | θ ) p ( θ ) r , (9)where p ( θ ) is the prior distribution of the parameters, whichcan be based on speaker independent models, and r is anempirically determined weighting factor. Gauvain and Lee [79]presented an approach using MAP estimation as an adaptationapproach for HMM/GMM systems. A convenient choice offunction for p ( θ ) is the conjugate to the likelihood – thefunction which ensures the posterior has the same form as the prior. For a GMM, if it is assumed that the mixture weights c i and the Gaussian parameters ( µ i , Σ i ) are independent, thenthe conjugate prior may take the form of a mixture model p D ( c i ) Q i p W ( µ i , Σ i ) , where p D () is a Dirichlet distribution(conjugate to the the multinomial) and p W () is the normal-Wishart density (conjugate to the Gaussian). This results inthe following intuitively understandable parameter estimate forthe adapted mean of a Gaussian: ˆ µ = τ µ + P n γ ( n ) x n τ + P n γ ( n ) , (10)where µ is the unadapted (speaker-independent) mean, x n is the n th adaptation acoustic vector, γ ( n ) is the compo-nent occupation probability (responsibility) for the Gaussiancomponent at time n (estimated by the forward-backwardalgorithm), and τ is a positive scalar-valued parameter of thenormal-Wishart density, which is typically set to a constantempirically (although Gauvain and Lee [79] also discuss anempirical Bayes estimation approach for this parameter). There-estimated means of the Gaussian components take the formof a weighted interpolation between the speaker independentmean and data from the target speaker. When there is notarget speaker data for a Gaussian component, the parametersremain speaker-independent; as the amount of target speakerdata increases, so the Gaussian parameters approach the targetspeaker maximum likelihood estimate.In feature-based adaptation approaches, it is usual to adaptor normalize the acoustic features for each speaker in both thetraining and test sets. For example, in the case of cepstral meanand variance normalization (CMVN), statistics are computedfor each speaker and the features normalized accordingly,during both training and test. Likewise, VTLN is also carriedout for all speakers, transforming the acoustic features toa canonical form, with the variation from changes in vocaltract length being normalized away. However, in the model-based approaches discussed above (MLLR and MAP), we haveimplicitly assumed that adaptation takes place at test time:speaker independent models are trained using recordings ofmultiple speakers in the usual way, with only the test speakersused for adaptation. In contrast to this, it is possible to employa model-based adaptive training approach.In speaker adaptive training [100], a transform is estimatedfor each speaker in the training set, as well as for eachspeaker in the test set. In the case of MLLR, at each iterationof training, the adaptation transforms are updated for eachspeaker, followed by an estimate of the canonical speakerindependent model given the set of the speaker transforms.Hence adapted mean vectors and covariance matrices arecomputed for each speaker in the training set. At test time,the target speaker transforms are estimated as usual. Multipletypes of adaptive training can be profitably combined – forexample performing fMLLR adaptive training as a form offeature normalization, together with MLLR adaptive trainingfor model adaptation [101].Speaker space approaches represent a speaker-adaptedmodel as a weighted sum of a set of individual models whichmay represent individual speakers or, more commonly, speaker VERVIEW PAPER SUBMITTED TO OJSP 6 clusters. In cluster-adaptive training (CAT) [11], the mean fora Gaussian component for a specific speaker s is given by: ˆ µ = C X c =1 w c µ c (11)where µ c is the mean of the particular Gaussian component forspeaker cluster c , and w c is the cluster weight. This expressesthe speaker-adapted mean vector as a point in a speaker space.Given a set of canonical speaker cluster models, CAT isefficient in terms of parameters, since only the set of clusterweights need to be estimated for a new speaker. Eigenvoices[102] are alternative way of constructing speaker spaces, with aspeaker model again represented as a weighted sum of canoni-cal models. In the Eigenvoices technique, principal componentanalysis of “supervectors” (concatenated mean vectors fromthe set of speaker-specific models) is used to create basis ofthe speaker space.A number of variants of cluster-adaptive training have beenpresented, including representing a speaker by combiningMLLR transforms from the canonical models [11], and usingsequence discriminative objective functions such as minimumphone error (MPE) [103]. Techniques closely related to CAThave been used for the adaptation of neural network basedsystems (Sec. VI).IV. A DAPTATION ALGORITHMS FOR
NN-
BASED
ASRThe literature describing methods for adaptation of NNshas tended to inherit terminology from the algorithms used toadapt HMM-GMM systems, for which there an important dis-tinction between feature space and model space formulationsof MLLR-type approaches [98], as discussed in the previoussection. In a 2017 review of NN adaptation, Sim et al [104]divide adaptation algorithms into feature normalisation , fea-ture augmentation and structured parameterization . (They alsouse the a further category termed constrained adaptation ,discussed further below.)The task of an ASR model is to map a sequence ofacoustic feature vectors, X = ( x , . . . x t , . . . , x T ) , x t ∈ R d to a sequence of words W . Although – as we discuss below –most techniques described in this paper apply equally to end-to-end models and hybrid HMM-NN models, we generallytreat the model to be adapted as an acoustic model. Thatis, we ignore aspects of adaptation that affect only P ( W ) ,independently of the acoustics X (LMl adaptation is discussedin Sec. XII). Further, with only a small loss of generality, inwhat follows we will assume that the model operates in aframewise manner, thus we can define the model as: y t = f ( x t ; θ ) (12)where f ( x ; θ ) is the NN model with parameters θ and y t is theoutput label at frame t . In a hybrid HMM-NN system, for ex-ample, y t is taken to be a vector of posterior probabilities overa senone set. In a CTC model, y t would be a vector of posteriorprobabilities over the output symbol set, plus blank symbol.Note that NN models often operate on a wider windowed set ofinput features, x t ( w ) = [ x t − c , x t − c +1 , . . . , x t + c − , x t + c ] withthe total window size w = 2 c + 1 . For reasons of notational clarity, we generally ignore the distinction between x t and x t ( w ) , unless it is specifically relevant to a particular topic.In this framework, we can define feature normalisation approaches as acting to transform the features in a speaker-dependent manner, on which the speaker-independent modeloperates. For each speaker s , a transformation function g : R d → R d computes: x ′ t = g ( x t ; φ s ) (13)where φ s is a set of speaker-dependent parameters. This familyis closely related to feature space methods used in GMMsystems described above in Sec. III, including fMLLR (whenonly a single affine transform is used), VTLN, and CMVN. Structured parameterization approaches, in contrast, intro-duce a speaker-dependent transformation of the acoustic modelparameters: θ ′ s = h ( θ ; ϕ s ) (14)In this case, the function h would typically be structured soas to ensure that the number of speaker-dependent parameters ϕ s is sufficiently smaller than the number of parameters of theoriginal model. Such methods are closely related to model-based adaptation of GMMs such as MLLR.Finally, feature-augmentation approaches extend the featurevector x t with a speaker-dependent embedding λ s , which wecan write as x ′ t = (cid:18) x t λ s (cid:19) (15)Close variants of this approach use the embedding to augmentthe input to higher layers of the network. Note that theincorporation of an embedding requires the addition of furtherparameters to the acoustic model controlling the manner inwhich the embedding acts to adapt the model, which canbe written f ( x t ; θ, θ E ) . The embedding parameters θ E arethemselves speaker-independent.We argue that the distinctions described above are notparticularly helpful in the field of NN adaptation. In ML-estimated generative acoustic models such as GMMs, thedistinction between feature-space and model adaptation isimportant (as noted by Gales [98]) because in the formercase, different feature space transformations can be carriedout per senone class if the appropriate scaling by a Jacobianis performed; in the latter case, it is necessary for the adaptedprobability density functions to be re-normalized. In NN adap-tation, however, all three approaches can be seen to be closelyrelated or even special cases of each other. For example,the normalisation function g can generally be formulated asshallow NN, possibly without a non-linearity. If there is a setof “identity transform” parameters φ I such that g ( x t ; φ I ) = x t , ∀ x t (16)then we have y t = f ( x t ; θ ) = f ( g ( x t ; φ I ); θ ) = f ′ ( x t ; θ, φ I ) (17)where f ′ is new network comprising a copy of the originalnetwork f with the layers of g prepended. Applying featurenormalisation (13) leads to: y t = f ( x ′ t ; θ ) = f ( g ( x t ; φ s ); θ ) = f ′ ( x t ; θ, φ s ) (18) VERVIEW PAPER SUBMITTED TO OJSP 7 which we can write this as a structured parameter transforma-tion of f ′ , as defined in (14): θ ′ s = { θ, φ s } = h ( { θ, φ I } ; ϕ s ) (19)where the transformation h ( · ; ϕ s ) is simply set to replacethe parameters pertaining to g with the original normalisationparameters, φ s = ϕ s , leaving the other parameters unchanged.Feature augmentation approaches may be readily seen to bea further special case of structured adaptation. In the simplecase of input feature augmentation (15), we see that the outputof the first layer, prior to the non-linearity, can be written as z = W x ′ + b = W (cid:18) xλ s (cid:19) + b (20)where W and b are the weight and bias of the first layer respec-tively. By introducing a decomposition of W , W = (cid:0) U V (cid:1) we write this as z = (cid:0) U V (cid:1) (cid:18) xλ s (cid:19) + b = U x + b + V λ s (21)with U ∈ θ and V ∈ θ E being weight matrices pertaining tothe input features and speaker embedding, respectively.This can be expressed as a structured transformation of thebias: θ ′ s = { U ′ , b ′ } = h ( { U, b } ; ϕ s ) = { U, b + V λ s } (22)with ϕ s = V λ s . Similar arguments apply to embeddings usedin other network layers.Certain types of feature normalisation approaches can beexpressed as feature augmentation. For example, cepstral meannormalisation given by x ′ t = g ( x t ; φ s ) = x t − µ s (23)can be expressed as z = W ( x − µ s ) + b = (cid:0) W W (cid:1) (cid:18) x − µ s (cid:19) + b (24)with augmented features λ s = − µ s .Approaches to NN adaptation under the traditional catego-rization of feature augmentation, structured parameterizationand feature normalization can usually be seen as special casesof one another. Therefore, in the remainder of this paper, weadopt an alternative categorization: • Embedding-based approaches in which any speaker-dependent parameters are estimated independently of themodel, with the model f ( x t ; θ ) itself being unchangedbetween speakers, other than the possible need to addedadditional embedding parameters θ E ; • Model-based approaches in which the model parameters θ are directly adapted to data from the target speakeraccording to the primary objective function; • Data augmentation approaches which attempt to syn-thetically generate additional training data with a closematch to the target speaker, by transforming the existingtraining data.This distinction is, we believe, particularly important in speaker adaptation of NNs because in ASR it has becomestandard to perform adaptation in a semi-supervised manner, with no transcribed adaptation data for the target speaker. Inthis setting, as we will discuss, standard objective functionssuch as cross-entropy, which may be very effective in su-pervised training or adaptation, are particularly susceptible totranscription errors in semi-supervised settings.We describe the model-independent approaches asembedding-based because any set of speaker-dependentparameters can be viewed as an embedding. Embedding-based approaches are discussed in Sec. V. Well-knownexamples of speaker embeddings include i-vectors [8], [105],and x-vectors [106], but can also include parameter sets moreclassically viewed as normalizing transforms such as CMVNstatistics and global fMLLR transforms (see Sec. III above).However, for the reasons mentioned above, we exclude fromthis category methods where the embedding is simply a subsetof the primary model parameters and estimated according tothe model’s objective function. Note that methods using aone-hot encoding for each speaker are also excluded, since itwould be impossible to use these with a speaker-independentmodel, without each test speaker having been present intraining data; such methods might however be useful forclosely related tasks such as domain adaptation, discussed inSec. XI.The primary benefit of speaker adaptive approaches oversimply using speaker-dependent models is the prevention ofover-fitting to the adaptation data (and its possibly errorfultranscript). A large number of model-based adaptation tech-niques have been proposed to achieve this; in this paper, wesub-divide them into: • Structured transforms : Methods in which there a subsetof the parameters are adapted, with many instances struc-turing the model so as to permit a reduced number ofspeaker-dependent parameters, as in LHUC [71], [107].The can be viewed as an analogy to MLLR transformsfor GMMs. They are discussed in Sec. VI. • Regularization : Methods with explicit regularization ofthe objective function to prevent over-fitting to the adapta-tion data, examples including the use of the use of L2 lossor KL divergence terms to penalize the divergence fromthe speaker-independent parameters [108], [109]. Suchmethods can be viewed as related to the MAP approachfor GMM adaptation. They are discussed in Sec. VII. • Variant objective functions : Methods which adopt vari-ants of the primary objective function to overcome theproblems of noise in the target labels, with examplesincluding the use of lattice supervision [75] or multi-tasklearning [110]. They are discussed in Sec. VIII.The second two categories above are collectively termed constrained adaptation in the review by Sim et al [104].Within this, multi-task learning is labeled by Sim et al asattribute aware training; however, we do not believe that allmulti-task learning approaches to adaptation can be labeled inthis way.Data augmentation methods have proved very successful inadaptation to other sources of variability, particularly those– such as background noise conditions – where the requiredmodel transformations are hard to explicitly estimate, butwhere it is easy to generate realistic data. In the case of speaker
VERVIEW PAPER SUBMITTED TO OJSP 8 adaptation, it is significantly harder to generate sufficientlygood-quality synthetic data for a target speaker, given onlylimited data from the speaker in question. However, there is agrowing body of work in this area using, for example, tech-niques from the field of speech synthesis [111]. Approachesin this area are discussed in Sec. IX.Most works suitable for adapting hybrid acoustic modelscan be leveraged to adapt acoustic encoders in E2E mod-els. Both Kullback-Leibler divergence (KLD) regularization(Sec. VII) and multi-task learning (MTL) methods (Sec. VIII)have been used for speaker adaptation for CTC and AEDmodels [112], [113].Sim et al [114], updated the acoustic encoder of RNN-Tmodels using speaker-specific adaptation data. Furthermore, bygenerating text-to-speech (TTS) audio from the target speaker,more data can be used to adapt acoustic encoder. Such dataaugmentation adaptation (discussed in Sec. IX) was shown tobe an effective way for the speaker adaption of E2E models[115] even with very limited raw data from the target speaker.Embeddings have also been used to train a speaker-awareTransformer AED model [116].Because AED and RNN-T also have components corre-sponding to the language model, there are also techniquesspecific to adapting the language modeling aspect of E2Emodels, for instance using a text embedding instead of anacoustic embedding to bias an E2E model in order to produceoutputs relevant to the particular recognition context [117]–[119]. If the new domain differs from the source domainmainly in content instead of acoustics, domain adaptation onE2E models can be performed by either interpolating the E2Emodel with an external language model (Sec. XII) or updatinglanguage model related components inside the E2E model withthe text-to-speech audio generated from the text in the newdomain [120], [121], discussed in Sec. XI.V. S
PEAKER EMBEDDINGS
Speaker embeddings map speakers to a continuous space.In this section we consider embeddings that may be extractedin a manner independent of the model. They can thereforealso be useful in a standalone manner for other tasks such asspeaker recognition. When used with an acoustic model, themodel learns how to incorporate the embedding informationby, in effect, speaker-aware training. Speaker embeddings mayencode speaker-level variations that are otherwise difficult forthe AM to learn from short-term features [65], and may beincluded as auxiliary features to the network. Specifically, let x ∈ R d denote the acoustic features, and λ s ∈ R k a k-dimensional speaker embedding. The speaker embeddings maybe concatenated with the acoustic input features, as previouslyseen in (15): x ′ t = (cid:18) x t λ s (cid:19) (25)Alternatively they may be concatenated with the activationsof a hidden layer. In either case the result is bias adaptationof the next hidden layer as discussed in Sec. VI. As noted byDelcroix et al. [122] the auxiliary features may equivalently be added directly to the features using a learned projectionmatrix P , with the benefit that the downstream architecturecan remain unchanged: x ′ t = x t + P λ s (26)There are many other ways to incorporate embeddings intothe AM: for example, they may be used to scale neuronactivations as in LHUC [71]. More generally we may considerembeddings applied to either biases or activations throughcontext-adaptive [123] or control networks [124]. It is possibleto limit connectivity from the auxiliary features to the rest ofthe network in order to improve robustness at test time orto better incorporate static features [125]–[127]. Later in thissection we shall discuss embeddings used as label targets, aswell as embeddings as transformations of the input featuresthemselves.Since embeddings are estimated independently of the AM,there is a large variety of extraction methods, which aretypically unsupervised with respect to the transcript. Manytypes of embeddings stem from research in speaker verificationand speaker recognition. One such approach is identity vectors,or i-vectors [8], [105], [128], which are estimated using meansfrom GMMs trained on the acoustic features. Specifically, theextraction of a speaker i-vector, λ s ∈ R k , assumes a linearrelationship between the global means from a backgroundGMM (or universal background model, UBM), m g ∈ R m ,and the speaker-specific means, m s ∈ R m m s = m g + T λ s (27)where T ∈ R m × k is a matrix that is shared across all speakerswhich is sometimes called the total variability matrix fromits relation to joint factor analysis [129]. An i-vector thuscorresponds to coordinates in the column space of T . T isestimated iteratively using the EM algorithm. It is possibleto replace the GMM means with posteriors or alignmentsfrom the AM [125], [130], [131] although this is no longerindependent of the AM and requires transcriptions. The i-vectors are usually concatenated with the acoustic features asdiscussed above, but have also been used in more elaboratearchitectures to produce a feature mapping of the input featuresthemselves [132], [133].Some approaches extract low-dimensional embeddings frombottleneck layers in neural network models trained to dis-tinguish between speakers [65], [126] or across multiplelayers followed by dimensionality reduction in a separateAM ( e.g. CNN embeddings [134]). One such approach, usingBottleneck Speaker Vector (BSV) embeddings [65], trains afeed-forward network to predict speaker labels (and silence)from spliced MFCCs (Fig. 2a). Tan et al [126] proposed toadd a second objective to predict monophones in a multi-tasksetup. The bottleneck layer dimension is typically set to valuescommonly used for i-vectors. In fact, Huang and Sim [65]note that if the speaker label targets are replaced with speakerdeviations from a UBM, then the bottleneck-features may beconsidered frame-level i-vectors. The extracted features areaveraged across all speech frames, T s , of a given speaker bya simple average: VERVIEW PAPER SUBMITTED TO OJSP 9 λ s = 1 T s T s X t λ s,t (28)There are a number of later approaches that we may col-lectively refer to as ⋆ -vectors. Like bottleneck features, theseapproaches typically extract embeddings from neural networkstrained to discriminate between speakers, but not necessarilyusing a low-dimensional layer. For instance, deep vectors, ord-vectors [135], [136], extract embeddings from feed-forwardor LSTM networks trained on filterbank features to predictspeaker labels. The activations from the last hidden layerare averaged over time. X-vectors [106], [124] use TDNNswith a pooling layer that collects statistics over time and theembeddings are extracted following a subsequent affine layer.A related approach called r-vectors [137] uses the architectureof x-vectors, but predicts room impulse response (RIR) labelsrather than speaker labels. In contrast to the above approaches,label embeddings, or l-vectors [138], are designed to be usedas soft output targets for the training of an AM. Each labelembedding represents the output distribution for a particularsenone target. In this way they are, in effect, uncoupledfrom the individual data points and can be used for domainadaptation without a requirement of parallel data. We willdiscuss this idea further in Sec. XI. For completeness we alsomention h-vectors [139] which use a hierarchical attentionmechanism to produce utterance-level embeddings, but hasonly been applied to speaker recognition tasks.X-vector embeddings are not widely used for adaptatingASR algorithms in practice – especially in comparison tocommonly used i-vectors – as experiments have not shownconsistent improvements in recognition accuracy. One reasonfor this is related to the speaker identification training objectivefor the x-vector network which implicitly factors out channelinformation, which might be beneficial for adaptation. Theoptimal objective for speaker embeddings used in ASR differsfrom the objective used in speaker verification.Summary networks [60], [122] produce sequence levelsummaries of the input features and are closely related to ⋆ -vectors ( cf. Fig. 2b). Auxiliary features are produced by aneural network that takes as input the same features as theAM, and produces embeddings by taking the time-average ofthe output. By incorporating the averaging into the graph, thenetwork can be trained jointly with the AM in an end-to-endfashion [122]. A related approach is to produce LHUC featurevectors (Sec. VI) from an independent network with embeddedaveraging [140].We also consider speaker-level transformations of the acous-tic features as speaker embeddings. These include methodstraditionally viewed as normalisation, such as CMVN andfMLLR, which produce affine transformations of the features: x ′ s = A s x + b s (29)CMVN derives its name from the application to cepstralfeatures, but corresponds to a the standardization of thefeatures to zero mean and unit variance (z-score): x ′ s = x − µ √ σ + ǫ (30) where µ is the cepstral mean, σ is the cepstral variance, and ǫ is a small constant for numerical stability.fMLLR belongs to the family of Maximum Likelihood Lin-ear Regression (MLLR) speaker adaptation methods originallydeveloped for HMM-GMM models [98], but which has laterbeen used with success to transform features for hybrid models[141], [142]. fMLLR obtains feature-space affine transformsby maximimising the likelihood of the data, typically usingthe EM algorithm and HMM-GMM models. The transformmay also be estimated using a neural network trained toestimate fMLLR features [143] (structurally similar transformsestimated using the main objective function are discussed inSec. VI). Instead of transforming the input features, somework has explored fMLLR features as an additional, auxiliary,feature stream to the standard features in order to improve ro-bustness to mismatched transforms [127], or to obtain speaker-adapted features derived from GMM log-likelihoods [144],otherwise known as GMM-derived features.VTLN is a physiologically motivated feature transformationtechnique [85], [86], [88], [145] which aims to control forvarying vocal tract lengths between speakers by adjusting thefilterbank in feature extraction. Typically, a piecewise linearwarping function is used, which requires a single warpingfactor parameter. This can be estimated using any AM with aline search. Alternatively there are a range of techniques calledlinear-VTLN which obtain a corresponding affine transformsimilar to fMLLR, but choosing from a fixed set of transformsat test time ( e.g. [89]). A related idea is that of the exponentialtransform [146], which forgoes any notion of vocal tractlength, but akin to VTLN is controlled by a single parameter.More recently, adaptation of learnable filterbanks, operatingas the first layer in a deep network, has resulted in updateswhich compensate for vocal tract length differences betweenspeakers [147].The embedding method is also helpful to the adaptationof E2E systems. Fan et al [116] generated a soft embeddingvector by combining a set of i-vectors from multiple speakerswith the combination weight calculated from the attentionmechanism. The soft embedding vector is appended to theacoustic encoder output of the E2E model, helping the modelto normalize speaker variations.In addition to acoustic embeddings, E2E models can alsoleverage text embedding to improve their modeling accuracy.For example, E2E models can be optimized to produce outputsrelevant to the particular recognition context, for instance usercontacts or device location. One solution is to add a contextbias encoder in addition to the original audio encoder into E2Emodels [117]–[119]. This bias encoder takes a list of biasingphrases as the input. The context vector of the biasing listis generated by using the attention mechanism, and is thenconcatenated with the context vector of acoustic encoder andis fed into the decoder.VI. S TRUCTURED TRANSFORMS
Methods to adapt the parameters θ of a neural networkbased-acoustic model f ( x ; θ ) can be split into two groups.The first group adapts the whole acoustic model or some VERVIEW PAPER SUBMITTED TO OJSP 10Input FeaturesFeatureExtractorSpeakerBottleneckFeatureSpeakerClassifierSpeaker PosteriorsSpeaker Loss AcousticModelSenone PosteriorSenone Loss S p ea k e r C l a ss i fi e r (a) Input FeaturesFeatureExtractorTimePoolingSpeakerEmbedding AcousticModelSenone PosteriorSenone Loss S u mm a r y N e t w o r k (b) Fig. 2: (a) Bottleneck feature extraction that uses a pretrained speaker classifier. (b) Summary network extracting speakerembeddings which is trained jointly with the acoustic model.of its layers [108], [109], [148]. The second group employsstructured transformations [104] to transform input features x ,hidden activations h or outputs y of the acoustic model. Suchtransformations include the linear input network (LIN) [149],linear hidden network (LHN) [150] and the linear outputnetwork (LON) [151]. These transforms are parameterizedwith a transformation matrix A s ∈ R n × n and a bias b s ∈ R n .The transformation matrix A s is initialized as an identitymatrix and the bias b s is initialized as a zero vector priorto speaker adaptation. The adapted hidden activations thenbecome h ′ = A s h + b s . (31)However, even a single transformation matrix A s can containmany speaker dependent parameters, making adaptation sus-ceptible to overfitting to the adaptation data. It also limits itspractical usage in real world deployment because of memoryrequirements related to storing speaker dependent parametersfor each speaker. Therefore there has been considerable re-search into how to structure the matrix A s and the bias b s toreduce the number of speaker dependent parameters.The first set of approaches restricts the adaptation matrix A s to be diagonal. If we denote the diagonal elements as r s = diag ( A s ) , then the adapted hidden activations become h ′ i = r s ⊙ h i + b s . (32)There are several methods that belong to this set of adaptationmethods. Learning Hidden Unit Contributions (LHUC) [71],[107] adapts only the parameters r s : h ′ i = r s ⊙ h i . (33)Speaker Codes [152], [153] prepend an adaptation neural net-work to an existing SI model in place of the input features. Theadaptation network – which operates somewhat similarly tocontrol networks, described below – uses the acoustic features as inputs, as well as an auxiliary low-dimensional speaker codewhich essentially adapts speaker dependent biases within theadaptation network: h ′ i = h i + b s . (34)The network and speaker codes are learned by back-propagating through the frozen SI network with transcribedtraining data. At test time the speaker codes are derivedby freezing all but the speaker code parameters and back-propagating on a small amount of adaptation data.Similarly, Wang and Wang [154] proposed a method thatadapts both r s and b s as parameters β s and γ s of a batchnormalization layer, adapting both the scale and the offset ofthe hidden layer activations with mean µ and the standarddeviation σ : h ′ = γ s h − µσ + β s . (35)Mana et al [155] showed that batch normalization layers canbe also updated by recomputing the statistics µ and σ in onlinefashion.A similar approach with a low-memory footprint adapts theactivation functions instead of the scale r s and offset b s . Zhangand Woodland [156] proposed the use of parameterised sig-moid and ReLU activation functions. With the parameterisedsigmoid function, hidden activations h i are computed fromhidden pre-activations z i as h = η s
11 + e − γ s z i + ζ s , (36)where η s , γ s and ζ s are speaker dependent parameters. | η s | controls the scale of the hidden activations, γ s controls theslope of the sigmoid function and ζ s controls the midpoint ofthe sigmoid function. Similarly, parameterised ReLU activa-tions was defined as h = ( α s z if z > β s z if z ≤ , (37) VERVIEW PAPER SUBMITTED TO OJSP 11 where α s and β s are speaker dependent parameters thatcorrespond to slopes for positive and negative pre-activations,respectively.Other approaches factorize the transformation matrix A s into a product of low-rank matrices to obtain a compact set ofspeaker dependent parameters. Zhao et al [157] proposed the Low-Rank Plus Diagonal (LRPD) method, which reduces thenumber of speaker dependent parameters by approximating thelinear transformation matrix A s ∈ R n × n as A s ≈ D s + P s Q s , (38)where the D s ∈ R n × n , P s ∈ R n × k and Q s ∈ R k × n are treatedas speaker dependent matrices ( k < n ) and D s is a diagonalmatrix. This approximation was motivated by the assumptionthat the adapted hidden activations should not be very differentfrom the unadapted hidden activations when only a limitedamount of adaptation data is available; hence the adaptationlinear transformation should be close to a diagonal matrix. Infact, for k = 0 LRPD reduces to LHUC adaptation. LRPDadaptation can be implemented by inserting two hidden linearlayers and a skip connection as illustrated in Fig. 3b.Zhao et al [158] later presented an extension to LRPD called
Extended LRPD (eLRPD), which removed the dependency ofthe number of speaker dependent parameters on the hiddenlayer size by performing a different approximation of the lineartransformation matrix A s , A s ≈ D s + P T s Q, (39)where matrices D s ∈ R n × n and T s ∈ R k × k are treated asspeaker dependent, and matrices P ∈ R n × k and Q ∈ R k × n are treated as speaker independent. Thus the number of speakerdependent parameters is mostly dependent on k , which can bechosen arbitrarily.Instead of factorizing the transformation matrix, a techniquetypically known as feature-space discriminative linear regres-sion (fDLR) [141], [159], [160] imposes a block-diagonalstructure such that each input frame shares the same lineartransform. This is, in effect, a tied variation of LIN with areduction in the number of speaker dependent parameters.Another set of approaches uses the speaker dependentparameters as mixing coefficients α s for a set of bases B i which factorize the transformation matrix A s . Samarakoonand Sim [161], [162] proposed to use factorized hiddenlayers (FHL) that allow both speaker-independent and speakerdependent modelling. With this approach, activations of ahidden layer h with an activation function σ are computedas h = σ ( W + k X i =0 [ α s ] i B i ) x + b s + b ! . (40)Note, that when α s = 0 and b s = 0 , the activationscorrespond to a standard speaker independent model. If the bases B i are rank-1 matrices, B i = γ i ψ Ti , then this allows thereparameterization of (40) as [162]: h = σ ( W + k X i =0 [ α s ] i B i ) x + b s + b ! = σ ( W + k X i =0 [ α s ] i γ i ψ Ti ) x + b s + b ! = σ (cid:0) ( W + Γ D Ψ T ) x + b s + b (cid:1) , (41)where D = diag ( α s ) . This approach is very similar tothe factorization of hidden layers used for Cluster AdaptiveTraining of DNN networks (CAT-DNN) [12] that uses fullrank bases instead of rank-1 bases.Similarly, Delcroix et al [123] proposed to adapt activationsof a hidden layer with using a mixture of experts [163]. Theadapted hidden unit activations are then h ′ = k X i =0 [ α s ] i B i h. (42)There have also been approaches, that further reduce thenumber of speaker dependent parameters by removing the de-pendency on the hidden layer width by using control networksthat predict the speaker-dependent parameters r s = c r ( z s , θ r ) , (43) b s = c b ( z s , θ b ) , (44)In contrast to the adaptation network used in the SpeakerCodes scheme, the control networks themselves are speaker-independent, taking as input some lower dimensional speakerdependent representations z s ∈ R k , typically speaker em-beddings. As such, they form a link between structuredtransforms and the embedding-based approaches of Sec. V.The control networks c ∗ ( z s , θ ∗ ) can be implemented as asingle linear transformation or as a multi-layer neural network.These control networks are similar to the conditional affinetransformations referred to as Feature-wise Linear Modulation(FiLM) [164]. For example, Subspace LHUC [165] uses acontrol network to predict LHUC parameters r s from i-vectors λ s , resulting in a memory footprint reduction comparedto standard LHUC adaptation. Cui et al [166] used auxiliaryfeatures to adapt both the scale r s and offset b s . Otherapproaches adapted the scale r s or the offset b s by leveragingthe information extracted with summary networks instead ofauxiliary features [167]–[169].Finally, the number of speaker dependent parameters inall the aforementioned linear transformations can be reducedby applying them to bottleneck layers that have much lowerdimensionality than the standard hidden layers. These bot-tleneck layers can be obtained directly by training a neuralnetwork with bottleneck-layers or by applying Singular ValueDecomposition (SVD) to the hidden layers [170], [171].VII. R EGULARIZATION METHODS
Even with the small number of speaker dependent parame-ters required by structured transformations, speaker adaptationcan still overfit to the adaptation data. One way to prevent this
VERVIEW PAPER SUBMITTED TO OJSP 12 r s ∈ R n (a) P s ∈ R n × k Q s ∈ R k × n D s ∈ R n × n (b) P ∈ R n × k T s ∈ R k × k Q ∈ R k × n D s ∈ R n × n (c) Fig. 3: Structured transforms of an adaptation matrix A s : (a) Learning Hidden Unit Contributions (LHUC) adapts onlydiagonal elements of the transformation matrix r s = diag ( A s ) ; (b) Low-Rank Plus Diagonal factorizes the adaptation matrixas A s ≈ D s + P s Q s ; (c) Extended LRPD factorizes the adaptation matrix as A s ≈ D s + P T s Q .overfitting is through the use of regularization methods thatprevent the adapted model from diverging too far far from theoriginal model. This can be achieved by using early stoppingand appropriate learning rates, which can be obtained with ahyper-parameter grid-search or by meta-learning [172], [173].Another way to prevent the adapted model from diverging toofar from the original can be achieved by limiting the distancebetween the original and the adapted model. Liao [108]proposed to use the L2 regularization loss of the distancebetween the original speaker dependent parameters θ s and theadapted speaker dependent parameters θ ′ s L L = | θ s − θ ′ s | . (45)Yu et al [109] proposed to use Kullback-Leibler (KL) diver-gence to measure the distance between the senone distributionsof the adapted model and the original model L KL = D KL ( f ( x ; θ ) || f ( x ; θ ′ s )) . (46)If we consider the overall adaptation loss using cross-entropy: L = (1 − λ ) L xent + λ L KL , (47)we can show that this loss equals to cross-entropy with thetarget distribution P ( Y | X ) = (1 − λ ) ˆ P ( Y | X ) + λf ( x ; θ ) , (48)where ˆ P ( Y | X ) is a distribution corresponding to the providedlabels y adapt . Although initially proposed for adapting hybridmodels, the KLD regularization method may also be used forspeaker adaption of E2E models [112], [113], [174].Meng et al [175] noted that KL divergence is not a distancemetric between distributions because it is asymmetric, andtherefore proposed to use adversarial learning which guar-antees that the local minimum of the regularization termis reached only if the senone distributions of the speakerindependent and the speaker dependent models are identical.They achieve this by adversarially training a discriminator d ( x ; φ ) whose task is to discriminate between the speakerdependent deep features h ′ and speaker independent deepfeatures h that are obtained by passing the input adaptation Senone LossSenone PosteriorsSD SenoneClassifierSD DeepFeatureSD FeatureExtractor Input Adaptation FrameDiscrimination LossSD/SI PosteriorDiscriminatorNetworkSI DeepFeatureSI FeatureExtractorSD AM GradientReversal
Fig. 4: Adversarial speaker adaptation.frames through speaker dependent and speaker independentfeature extractor respectively. This process is illustrated inFig. 4. The regularization loss of the discriminator is L disc = − log d ( h ; φ ) − log [1 − d ( h ′ ; φ )] , (49)where h are hidden layer activations of the speaker inde-pendent model and h ′ are hidden layer activations of theadapted model. The discriminator is trained in a minimaxfashion during adaptation by minimizing L disc with respectto φ and maximizing L disc with respect to θ s . Consequently,the distribution of activations of the i-th hidden layer of thespeaker dependent model will be indistinguishable from thedistribution of activations of the i-th hidden layer of thespeaker independent model, which ought to result in morerobust performance of speaker adaptation.Other approaches aim to prevent overfitting by leveragingthe uncertainty of the speaker-dependent parameter space.Huang et al [176] proposed Maximum A Posteriori (MAP)adaptation of neural networks, inspired by MAP adaptation of VERVIEW PAPER SUBMITTED TO OJSP 13
GMM-HMM models [79] (Sec. III). MAP adaptation estimatesspeaker dependent parameters as a mode of the distribution ˆ θ s = arg max θ s P ( Y | X, θ s ) p ( θ s ) , (50)where p ( θ s ) is a prior density of the speaker dependentparameters. In order to obtain this prior density, Huang etal [176] employed an empirical Bayes approach (followingGauvain and Lee [79]) and treated each speaker in the trainingdata as a data point. They performed speaker adaptationfor each speaker and observed that the speaker parametersacross speakers resemble Gaussians. Therefore they decidedto parameterise the prior density p ( θ s ) as p ( θ s ) = N ( θ s ; µ, Σ) , (51)where µ is the mean of adapted speaker dependent parametersacross different speakers, and Σ is the corresponding diagonalcovariance matrix. With this parameterisation the regulariza-tion term of the prior density p ( θ s ) is L MAP = 12 ( θ s − µ ) T Σ − ( θ s − µ ) , (52)which for the prior density p ( θ s ) = N ( θ s ; 0 , I ) degeneratesto the L2 regularization loss. Huang et al investigated theirproposed MAP approach with LHN structured transforms, butnoted that it may be used in combination with other schemes.Xie at al [177] proposed a fully Bayesian way of dealingwith uncertainty inherent in speaker dependent parameters θ s ,in the context of estimating the LHUC parameters r s (seeSec. VI). In this method, known as BLHUC, the posteriordistribution of the adapted model is approximated as: P ( Y | X, D adapt ) = Z P ( Y | X, r s ) p ( r s | D adapt ) dr s ≈ P ( Y | X, E [ r s | D adapt ]) , (53)Xie at al propose to use a distribution q ( r s ) as a variationalapproximation of the posterior distribution of the LHUCparameters, p ( r s |D adapt ) . For simplicity, they assume that both q ( r s ) and p ( r s ) are normal, such that q ( r s ) = N ( r s ; µ s , γ s ) and p ( r s ) = N ( r s ; µ , γ ) , which results in the expectationfor the speaker dependent parameters in (53) being given by : E [ r s | D adapt ] = µ s . (54)The parameters are computed using gradient descent with aMonte Carlo approximation. Similarly to MAP adaptation, theeffect is to force the adaptation to stay close to the speakerindependent model when we perform adaptation with a smallamount of adaptation data.VIII. V ARIANT OBJECTIVE FUNCTIONS
Another challenge in speaker adaptation is overfitting totargets seen in the adaptation data and to errors in semi-supervised transcriptions. This issue can be mitigated by anappropriate choice of objective function.Gemello et al [150] proposed
Conservative Training , whichmodifies the target distribution to ensure that labels not seen
Senone LossStudent SenonePosteriorSenoneClassifierDeep FeatureFeatureExtractorInput Feature Monophone LossMonophonePosteriorMonophoneClassifier
Fig. 5: Multi-task learning speaker adaptation.in the adaptation data will not be catastrophically forgotten.The adjusted target distribution is defined as p ( y i | x ) = f ( x ; θ ) i if y i ∈ U − P y j ∈ U f ( x ; θ ) j if y i ∈ S ∧ correct if y i ∈ S ∧ ¬ correct , (55)where S is a set of labels seen in the adaptation data and U is a set of labels not seen in the adaptation data.To mitigate errors in semi-supervised transcriptions we canreplace the transcriptions with a lattice of supervision, whichencodes the uncertainty arising from the first pass decoding.Lattice supervision has previously been used in work onunsupervised adaptation [72] and training [73] of GMMs, aswell as discriminative [178] and semi-supervised training [74],and adaptation [75], of neural network models. For instance,lattice supervision can be used with the MMI criterion: F MMI ( λ ) = R X r =1 log p λ ( X r | M numr ) p λ ( X r | M denr ) , (56)where the M numr is a numerator lattice containing multiplehypotheses from a first pass decoding and M denr is a denom-inator lattice containing all possible sequences of words.Another family of methods prevents overfitting to adaptationtargets by performing adaptation through the use of a lowerentropy task such as monophone or senone cluster targets.This has the advantage that the unsupervised targets mightbe less noisy and also that the targets have higher coverageeven with small amounts of adaptation data. Price et al [179]proposed to append a new output layer predicting monophonetargets on top of the original output layer predicting senones.The layer can be either full rank or sparse – leveragingknowledge of relationships between monophones and senones.Its parameters are trained on the training data with a fixedspeaker independent model. Only the mohophone targets areused for the adaptation of the speaker dependent parameters.Huang et al [110] presented an approach that used multi-tasklearning [180] to leverage both senone and monophone/senone VERVIEW PAPER SUBMITTED TO OJSP 14 clusters targets. It worked by having multiple output layers,each on top of the last hidden layer, that predicted thecorresponding targets. These additional output layers were alsotrained after a complete training pass of the speaker indepen-dent model with its parameters fixed. Thus, the adaptationloss was a weighted sum of individual losses, for examplemonophone and senone losses (Fig. 5). Swietojanski et al [181]combined these two approaches and used multi-task learningfor speaker adaptation through a structured output layer, whichpredicts both monophone targets and senone targets. Unlikethe approach by Price et al [179], the monophone predictionsare used for the prediction of senones.Li et al [112] and Meng et al [113] applied multi-tasklearning to speaker adaptation of CTC and AED models. TheseE2E models typically use subword units, such as word pieceunits, as the output target in order to achieve high recognitionaccuracy. The number of subword units is usually at the scaleof thousands or even more. Given very limited speaker-specificadaptation data, these units may not be fully covered. Multi-task learning using both character and subword units cansignificantly alleviate such sparseness issues.IX. D
ATA AUGMENTATION
Data augmentation has been proven to be an effective wayto decrease the acoustic mismatch between training and testingconditions. Data augmentation approaches supplement thetraining data with distorted or synthetic variants of speech withcharacteristics resembling the target acoustic environment,for instance with reverberation or interfering sound sources.Thanks to realistic room acoustic simulators [182] one cangenerate large numbers of room impulse responses and reuseclean corpora to create multiple copies of the same sentenceunder different acoustic conditions [183]–[185].Similar approaches have been proposed for increasing ro-bustness in speaker space by augmenting training data with,typically label-preserving, speaker-related distortions or trans-forms. Examples include creating multiple copies of cleanutterances with perturbed VTL warp factors [186], [187],augmenting related properties such as volume or speakingrate [21], [188], [189], or voice-conversion [190] inspiredtransformations of speech uttered by one speaker into anotherspeaker using stochastic feature mapping [187], [191], [192].While voice conversion does not create any new data withrespect to unseen acoustic / linguistic complexity (just replicasof the utterances with different voices, often from the samedataset), recent advances in text-to-speech (TTS) allows therapid building of new multi-speaker TTS voices [193] fromsmall amounts of data. TTS may then be used to arbitrarilyexpand the adaptation set for a given speaker, possibly to coverunseen acoustic domains [111], [115]. If TTS is coupled witha related natural language generation module, it is possibleto generate speech for domain-related texts. In this way, thespeaker adaptation uses more data, not only from the speaker’soriginal speech but also from the TTS speech. Because thetranscription used for TTS generation is also used for modeladaptation, this approach also circumvents the obstacle of thehypothesis error in unsupervised adaptation. Moreover, TTS generated data can also help to adapt E2E models to a newdomain which has more discrepancy in contents from thesource domain, which will be discussed in Sec. XI.Finally, for unbalanced data sets the acoustic models mayunder-perform for certain demographics that are not suffi-ciently represented in training data. There is an ongoing effortto address this using generative adversarial networks (GANs).For example, Hosseini-Asl et al [194] used GANs with a cycle-consistency constraint [195] to balance the speaker ratios withrespect to gender representation in training set.X. A
CCENT ADAPTATION
Although there is significant literature on automatic di-alect identification from speech ( e.g. [196]), there has beenless work on accent and dialect adaptive speech recognitionsystems. The MGB–3 [197] and MGB–5 [198] evaluationchallenges have used dialectal Arabic test sets, with a modernstandard Arabic (MSA) training set, using broadcast and inter-net video data. The best results reported on these challengeshave used a straightforward model-based transfer learningapproach in an LF-MMI framework, adapting MSA trainedbaseline systems to specific Arabic dialects [199], [200].Much of the reported work on accent adaptation has takenapproaches for speaker adaptation, and applied them usingan adaptation set of utterances from the target accent. Forinstance, Vergyri et al [201] used MAP adaptation of aGMM/HMM system. Zheng et al [202] used both MAPand MLLR adaptation, together with features selected tobe discriminative towards accent, with the accent adaptationcontrolled using hard decisions made by an accent classifier.Earlier work on accent adaptation focused on automaticallyadaptation of the pronunciation dictionary [203], [204]. Theseapproaches resemble approaches for acoustic adaptation ofVQ codebooks (discussed in section III), in that they learnan accent-specific transition matrix between the phonemicsymbols in the dictionary. Selection of utterances for accentadaptation has been explored, with Nallasamy et al [205]proposing an active learning approach.Approaches to accent adaptation of neural network-basedsystems have typically employed accent-dependent outputlayers and shared hidden layers [206], [207], based on asimilar approach to the multilingual training of deep neuralnetworks [208]–[210]. Huang et al [206] combined this withKL regularization (Sec. VII), and Chen et al [207] used accent-dependent i-vectors (Sec. V); Yi et al [211] used accent-dependent bottleneck features in place of i-vectors; and Turanet al [212] used x-vector accent embeddings in a semi-supervised setting.Multi-task learning approaches, where the secondary task isaccent/dialect identification has been explored by a number ofresearchers [213]–[217] in the context of both hybrid and end-to-end models. Improvements with multi-task training wereobserved in some instances, but the evidence indicates that itgives a small adaptation gain. Sun et al [218] replaced multi-task learning with domain adversarial learning (Sec. VIII), inwhich the objective function treated accent identification as anadversarial task, finding that this improved accented speechrecognition over multi-task learning.
VERVIEW PAPER SUBMITTED TO OJSP 15
More successfully, Li et al [219] explored learning multi-dialect sequence-to-sequence models using one-hot dialectinformation both as input. Grace et al [220] also used one-hotdialect codes and also explored a family of cluster adaptivetraining and hidden layer factorization approaches. In bothcases using one-hot dialect codes as an input augmentation(corresponding to bias adaptation) proved to be the bestapproach, and cluster-adaptive approaches did not result ina consistent gain. These approaches were extended by Yooet al [221] and Viglino et al [217] who both explored theuse of dialect embeddings for multi-accent end-to-end speechrecognition. Ghorbani et al [222] used accent-specific teacher-student learning, and Jain et al [223] explored a mixture ofexperts (MoE) approach, using mixtures of experts both at thephonetic and accent levels.Yoo et al [221] also applied a method of feature-wiseaffine transformations on the hidden layers (FiLM), that aredependent both on the networks internal state and the di-alect/accent code (discussed in Sec. VI). This approach, whichcan be viewed as a conditioned normalization, differs from theprevious use of one-hot dialect codes and multi-task learningin that it has the goal of learning a single normalized modelrather than an implicit combination of specialist models. Arelated approach is gated accent adaptation [224], althoughthis focused on a single transformation conditioned on accent.Winata et al [225] experimented with a meta-learningapproach for few-shot adaptation to accented speech, wherethe meta-learning algorithm learns a good initialization andhyperparameters for the adaptation.XI. D
OMAIN ADAPTATION
The performance of automatic speech recognition (ASR)always drops significantly when the recognition model isevaluated in a mismatched new domain. Domain adaptation isthe technology used to adapt the well-trained source domainmodel to the new domain. The most straightforward way isto collect and label data in the new domain to fine-tune themodel. Most adaptation technologies discussed in this papercan also be applied to domain adaptation [148], [226]–[228].In the following, we focus on technologies more specific todomain adaption.While conventional adaptation techniques require largeamounts of labeled data in the target domain, the teacher-student (T/S) paradigm [229], [230] can better take advantageof large amounts of unlabeled data and has been widely usedfor industrial scale tasks [231], [232].The most popular T/S learning strategy was proposed in2014 by Li et al. [229] to minimize the KL divergence betweenthe output posterior distributions of the teacher network andthe student network. This can also be considered as learningsoft targets generated by a teacher model instead of 1-hot hardtargets − T X t =1 N X y =1 P T ( s t = y | x t ) log P S ( s t = y | x t ) , (57)where P T and P S are posteriors of teacher and studentnetworks, x t and s t are the input speech and senone at time t , respectively. T is total speech frames in an utterance, and N is the number of senones in the network output layer.Later, Hinton et al. [230] proposed knowledge distillationby introducing a temperature parameter (like chemical distil-lation) to scale the posteriors. This has been applied to speechby e.g. Asami et al. [233]. There are also variations such aslearning the interpolation of soft and hard targets [230] andconditional T/S learning [234]. Although initially proposed formodel compression, T/S learning is also widely used for modeladaptation if source and target signals are frame-synchronized,which can be realized by simulation. The loss function is [7] − T X t =1 N X y =1 P T ( s t = y | x t ) log P S ( s t = y | ˆ x t ) , (58)where x t is the source speech signal while ˆ x t is the frame-synchronized target signal.The biggest advantage of T/S learning is that it can leveragelarge amounts of unlabeled data by using soft labels P T ( s t = y | x t ) . This is particularly useful in industrial setups whereeffectively unlimited unlabeled data is available [231], [232].Furthermore, soft labels produced by the teacher network carryknowledge learned by the teacher on the difficulty of classi-fying each sample, while the hard labels do not contain suchinformation. Such knowledge helps the student to generalizebetter, especially when adaptation data size is small.One constraint to T/S adaptation is that it requires pairedsource and target domain data. While the paired data can beobtained with simulation in most cases, there are scenariosin which it is hard to simulate the target domain data fromthe source domain data. For example, simulation of children’sspeech or accented speech remains challenging. In [138], aneural label embedding scheme was proposed for domainadaptation with unpaired data. A label embedding, l-vector,represents the output distribution of the deep network trainedin the source domain for each output token, e.g. , senone. Toadapt the deep network model to the target domain, the l-vectors learned from the source domain are used as the softtargets in the cross entropy criterion.It is usually hard to obtain the transcription in the targetdomain, therefore unsupervised adaptation is critical. Althoughthe transcription can be generated by decoding the targetdomain data using the source domain model, the generatedhypothesis quality is often poor given the domain mismatch.Recently, adversarial training was applied to the area ofunsupervised domain adaptation in a form of multi-task learn-ing [235] without the need for transcription in the targetdomain. Unsupervised adaptation is achieved by learning deepintermediate representations that are both discriminative forthe main task on the source domain and invariant with respectto mismatch between source and target domains. Domaininvariance is achieved by adversarial training of the domainclassification objective functions using a gradient reversallayer (GRL) [235]. This GRL approach has been applied toacoustic models for unsupervised adaptation in [236]–[238].There is also increasing interest in the use of GANs withcycle consistency constraints for domain adaptation [239]–[241]. This enables the use of non-parallel data without labels VERVIEW PAPER SUBMITTED TO OJSP 16Senone LossTeacher SenonePosteriorTeacherAcousticModelTeacherInput Feature Student SenonePosteriorSenoneClassifierDeep FeatureFeatureExtractorStudent InputFeature Condition LossConditionPosteriorConditionClassifierGradientReversalStudent AM ConditionLabel
Fig. 6: Adversarial T/S learning.in the target domain by learning to map the acoustic featuresinto the style of the target domain for training. The cycle-consistency constraint also provides the possibility of mappingfeatures from the target to the source style for, in effect, test-time adaptation or speech enhancement.Meng et al. [242] combine adversarial learning and T/Slearning as adversarial T/S learning shown in Fig. 6 to improvethe robustness against condition variability during adaptation.When only the left side of the figure is kept, adversarial T/Slearning is reduced to T/S learning. If the teacher network isremoved and the main network consumes source domain dataand its ground-truth labels, then adversarial T/S learning isreduced to adversarial learning.E2E models tend to memorize the training data well, andtherefore may not generalize well to a new domain. Meng etal [243] proposed T/S learning for the domain adaptation ofE2E models. The loss function is − L X l =1 N X y =1 P T ( u l = y | U l − , X ) log P S ( u l = y | U l − , ˆ X ) , (59)where X and ˆ X are the source and target domain speechsequence, U is the label sequence of length L which is eitherthe ground truth in the supervised adaptation setup or thehypothesis generated by the decoding of the teacher modelwith X in the unsupervised adaptation setup. Note that in theunsupervised case, there are two levels of knowledge transfer:the teachers token posteriors (used as soft labels) and one-bestpredictions as decoder guidance.While most of time the domain adaptation focuses on adap-tation to a new acoustic environment, there are scenarios inwhich the new domain differs from the source domain mainlyin content. In such situations, adapting the language model(LM) is more effective. Because E2E models usually have asub-network working as an LM in traditional hybrid systems,it is possible to adapt E2E models to a new domain using onlydomain-specific text data. In [120], [121], RNN-T models wereadapted to a new domain with the TTS data generated from the domain-specific text. Because the prediction network inRNN-T works similarly to a LM, adapting it without updatingthe acoustic encoder is shown to be more effective thaninterpolating the RNN-T model with an external LM trainedfrom the domain-specific text [121].XII. L ANGUAGE MODEL ADAPTATION
LM adaptation typically involves updating an LM estimatedfrom a large general corpus, with data from a target domain.Many approaches to LM adaptation were developed in the con-text of n-gram models, and are reviewed by Bellegarda [244].Hybrid NN/HMM speech recognition systems still make useof n-gram language models and a finite state structure, at leastin the first pass; it is difficult to use neural network LMs (withinfinite context) directly in first pass decoding in such systems.Neural network LMs are typically used to rescore lattices inhybrid systems, or may be combined (in a variety of ways) inend-to-end systems.The main techniques for n-gram language model adaptationinclude interpolation of multiple language models [245]–[247], updating the model using a cache of recently observed(decoded) text [245], [248]–[250], or merging or interpolatingn-gram counts from decoded transcripts [251]. There is alsoa large body of work incorporating longer scale context, forinstance modelling the topic and style of the recorded speech[252]–[255]. LM adaptation approaches making use of widercontext have often built on approaches using unigram statisticsor bag-of-words models, and a number of approaches forcombination with n-gram models have been proposed, forexample dynamic marginals [256].Neural network language modelling [257] has become state-of-the-art, in particular recurrent neural network languagemodels (RNNLMs) [258]. There has been a range of work onadaptation of RNNLMs, including the use of topic or genreinformation as auxiliary features [259], [260] or combinedas marginal distributions [261], domain specific embeddings[262], and the use of curriculum learning and fine-tuning totake account of shifting contexts [263], [264]. Approachesbased on acoustic model adaptation, such as LHUC [264] andLHN [260], have also been explored.There have a been a number of approaches to apply theideas of cache language model adaptation to neural networklanguage models [261], [265], [266], along with so-calleddynamic evaluation approaches in which the recent contextis used for fine tuning [261], [267].E2E models are trained with paired speech and text data.The amount of text data in such a paired setup is much smallerthan the amount of text data used in training a separate externalLM. Therefore, it is popular to adjust E2E models by fusingthe external LM trained with a large amount of text data. Thesimplest and most popular approach is shallow fusion [268],in which the external LM is interpolated log-linearly with theE2E model at inference time only.However, shallow fusion does not have a clear probabilisticinterpretation. McDermott et al [269] proposed a density ratioapproach based on Bayes’ rule. An LM is built on texttranscripts from the training set which has paired speech and
VERVIEW PAPER SUBMITTED TO OJSP 17 text data, and a second LM is built on the target domain.When decoding on the target domain, the output of theE2E model is modified by the ratio of target/training LMs.While it is well grounded with Bayes’ rule, the density ratiomethod requires the training of two separate LMs, from thetraining and target data respectively. Variani et al [270]proposed a hybrid autoregressive transducer (HAT) model toimprove the RNN-T model. The HAT model builds a trainingset LM internally and the label distribution is derived bynormalizing the score functions across all labels excludingblank. Therefore, it is mathematically justified to integrate theHAT model with an external or target LM using the densityratio formulation. Other domain adaptation methods for E2Emodels were discussed in Sec. XI.XIII. M
ETA A NALYSIS
In this section we present an aggregated review of pub-lished results in experiments applying adaptation algorithmsto speech recognition. This differs from typical experimentalreporting that focuses on one-to-one system comparisons typ-ically using a small fixed set of systems and benchmark tasksand data. The proposed meta-analysis approach offers insightsinto the performance of adaptation algorithms that are difficultto capture from individual experiments.We divide this section into four main parts. The first,Sec. XIII-A, explains the protocol and overall assumptions ofthe meta-analysis, followed by a top-level summary of findingsin Sec. XIII-B, with a more detailed analysis in Sec. XIII-C.The final part, Sec. XIII-D, aims to quantify the adaptationperformance across languages, speaking styles and data-sets.
A. Protocol and Literature
The meta-analysis is based on 45 peer-reviewed studiesselected such that they cover wide range of systems, archi-tectures, and adaptation tasks. Each study was required tocompare adaptation results versus a baseline, enabling the con-figurations of interest to be compared quantitatively. Note thatmeta-analysis spans several model architectures, languages,and domains; although most studies use word error rate (WER)as the evaluation metric, some studies used character errorrate (CER) or phone error rate (PER). Since we are interestedin the relative improvement brought by adaptation, we reportRelative Error Rate Reductions (RERR).The meta-analysis is based on the following studies: [8],[58], [62], [63], [70], [71], [108], [112]–[114], [122], [124],[126], [134], [136], [140], [141], [144], [147], [148], [153],[155], [162], [172]–[174], [189], [206], [207], [211], [222],[224], [225], [231], [243], [271]–[280].The analysis spans 33 data-sets (more than 50 unique { train, test } pairings), 23 of which are public and 10are proprietary. These cover different speaking styles,domains, acoustic conditions, applications and languages(though the study is strongly biased towards English re-sources). The public corpora used include the follow-ing: AISHELL2 [281], AMI [282], APASCI [283], Au-rora4 [284], CASIA [285], ChildIt [286], Chime4 [287],CSJ [288], ETAPE [289], HKUST [290], MGB [291], Total min1st quartilemedian 3rd quartile maxaverage All DatapointsTotal
Single Adaptation Method0 10 20 40 60 80Total
Relative Error Rate Reduction [%]Two (or more) Adaptation MethodsFig. 7: Aggregated summary of adaptation RERR from allstudies (top), considering single method only (middle) andtwo or more methods stacked (bottom). The top graph isannotated to explain the information presented in each of theboxplot graphs in this section.RASC863 [292], SWBD [293], TED [294], TED-LIUM [295],TED-LIUM2 [296], TIMIT [297], WSJ [298], PF-STAR [299],Librispeech [300], Intel Accented Mandarin Speech Recogni-tion Corpus [207], UTCRSS-4EnglishAccent [271].Overall, the meta-analysis is based on ASR systems trainedon datasets of combined duration of over 30,000 hours, whilebaseline acoustic models were estimated from as little as 5hours to over 10,000 hours of speech. Adaptation data variesfrom a few seconds per speaker to over 25,000 hours ofacoustic material used for domain adaptation.
B. Overall findings
Fig. 7 (Top) presents the average adaptation gains forall considered systems, adaptation methods, and adaptationclasses. The overall RERR is 9.96% . Since grouping dataacross attributes of interest may result in an unbalanced (orvery sparse) sample sizes, we also report additional statisticssuch as number of samples, datasets and studies the givenstatistic is based on. As can be seen in the right part of theFig. 7 (Top), the results in this review were derived from 337samples produced using 33 datasets reported in 45 studies. Asingle sample is defined as a 1:1 system comparison for whichone can unambiguously state RERR. Likewise, a dataset refersto a particular training corpus configuration. Note that theremay be some data-level overlap between different corporaoriginating from same source ( e.g. TED talks) and we make adistinction for acoustic condition ( e.g.
AMI close-talking anddistant channels are counted as two different data-sets whenthey are used to estimate separate acoustic models). A studyrefers to a single peer-reviewed publication. Although we do not report exact numbers in tabular form due to spacelimitations, both raw data and aggregated statistics for each figure in thisreview will be made available on github and IEEEDataPort prior to publicationof the final version of the paper.
VERVIEW PAPER SUBMITTED TO OJSP 18 embeddingfeaturemodel
All Adaptation Clustersembeddingfeaturemodel
Speaker Adaptation0 10 20 40 60 80embeddingfeaturemodel
Relative Error Rate Reduction [%]Domain AdaptationFig. 8: Comparison of feature, embedding, and model-level adaptation approaches. Speaker (middle) and domain(bottom) adaptations are based on { utterance, speaker } and { accent, child, domain, disordered } clusters, respectively.Depending on which property we want to measure the anal-ysis set can be split into smaller subsets, as the ones shown inthe lower part of Fig. 7. The majority of analyses in this revieware reported for models adapted using a single method withsome additional groupings used to better capture additionaldetails such as complimentarity of adaptation methods or theirperformance in different operating regimes.As mentioned in Sec. IV, adaptation methods were his-torically categorized based on the level they operated at inspeech processing pipeline. Fig. 8 (top) quantifies the ASRperformance along this attribute, showing that model-basedadaptation obtains best average improvements of 11.8%, fol-lowed by embedding and feature levels at 7.2% and 5.0%RERR, respectively. This is not surprising, as model leveladaptation allows large amounts of adaptation data to beleveraged by allowing the update of large portions of themodel (including re-training the whole model). In more data-constrained regimes, such as utterance or speaker-level adap-tation, where only a limited amount of adaptation data istypically available, differences are less pronounced and model-based speaker adaptation obtains 8.9% RERR while adaptingto domains gives 15.5% RERR ( cf. middle and bottom plotsin Fig 8). Embedding approaches stay at a similar level forspeaker adaptation, improving to 9.2% RERR for domainadaptation (although based on only two studies). Feature-spacedomain adaptation was used in only one study, which reporteda small deterioration of -0.3% RERR.The results for different adaptation clusters, introduced inSec. II, are shown in Fig. 9. Models benefit more whenadapting to accent, from adult to child speech, to the domain,and to disordered speech conditions (such as arising fromspeech motor discorders), as opposed to speaker or utter- 0 10 20 30 40 50 60 70 80AccentChildDomainDisorderedSpeakerUtterance Relative Error Rate Reduction [%]Fig. 9: Adaptation results for different adaptation clusters.E2EHybrid
E2EHybrid
Speaker Adaptation0 10 20 40 60 80E2EHybrid
Relative Error Rate Reduction [%]Domain AdaptationFig. 10: Comparison of adaptation results for hybrid andE2E systems.ance adaptation. This is expected, since domain adaptationusually has more adaptation data, and the acoustic mismatchintroduced by unseen domains is greater than the mismatchcaused by unseen speakers – unless these are substantiallymismatched to the training data as it is often the case forchild or disordered speech recognition. But in the latter casethe adaptation is typically not carried out at the speaker level,but at the domain level ( i.e. tailoring the acoustic model tobetter handle dysarthic speech, not a single dysarthic speaker).Fig. 10 aggregates the adaptation along the two mainneural network-based ASR approaches - hybrid and E2E. Itis interesting to observe that E2E systems benefit more fromadaptation (12.8% RERR) than hybrid systems (9.2% RERR)in both the overall and speaker-based regimes. This reversesfor domain adaptation, with E2E and hybrid improving by 12.2and 14.9% RERR, respectively. This is somewhat expected,as hybrid systems benefit from strong inductive biases – suchas access to pronunciation dictionaries and hand engineeredmodelling constraints – whereas E2E models must learn thesefrom data. Given limited amounts of training data one mayexpect that E2E may struggle to learn these as well as hybridmodels, as such adaptation brings greater gains. These resultssuggest that adaptation for E2E is a promising direction forfuture investigations, that remains under-investigated as of now- there are 10 studies in total on this topic in this meta-review.Next we compare feed-forward (FF) and recurrent neuralnetwork (RNN) architectures in both hybrid and E2E models.
VERVIEW PAPER SUBMITTED TO OJSP 19
FFRNN
FFRNN
Speaker Adaptation0 10 20 40 60 80FFRNN
Relative Error Rate Reduction [%]Domain AdaptationFig. 11: Comparison of adaptation results for FF and RNNarchitectures.Hybrid models can leverage either FF or RNN architectureswhile most E2E systems use some form of RNN. (Note,transformer-based E2E models [301] are build from FF (CNN)modules, however, due to their relative novelty in ASR thereis only one accent adaptation study included in our meta-analysis [225]). Fig. 11 reports similar adaptation gains of9.8% RERR for both FF and RNN architectures. RNNsseem to benefit more when adapting to speakers (9.2% vs7.4% RERR for RNN and FF, respectively), and less whenadapting to domain (10.4% vs 17.0% RERR for RNN andFF, respectively). When controlling for the system paradigm(E2E vs. Hybird), RNNs mostly benefit through adapting E2Emodels ( cf.
Fig 12 6.6% vs 15.7% RERR for Hybrid (RNN)and E2E (RNN), respectively). We observed a similar trendfor speaker and domain clusters separately (figure not shown).Fig. 13 compares the RERR for unsupervised and super-vised modes of adaptation. Overall, deriving the adaptationtransform with manually annotated targets results in an average12.8% RERR, whereas unsupervised methods result in 8%RERR. Fig. 13 shows results specifically for semi-supervisedadaptation, which are captured by the 2pass and enrol (Unsup.)conditions. Fig. 14 also shows further analysis on the modesof deriving adaptation statistics (Sec. II). Both online and two-pass adaptation are unsupervised, while enrolment mode maybe either supervised or unsupervised. The supervised approachoffers most accurate adaptation, as expected. Unsupervisedenrolment outperforms other the two unsupervised methodsmainly due to T/S domain adaptation study [243] (Sec. XI) thatleverages large amounts of data. When considering speakeradaptation only, the two-pass approach obtains 8.2% RERRand is more effective than enrol (Unsp.) (7.3% RERR) andonline adaptation (6.5% RERR).Finally, we consider the overall trends for the consideredsystems and their operating regions. Fig. 15 reports resultsobtained with different amounts of adaptation data. Fig. 16further shows regression trends when splitting by adaptationtype, hybrid or E2E, and adaptation clusters. These are inline with the observations so far: i) more adaptation databrings (on average) larger improvements; ii) model-based 0 10 20 40 60 80E2E (FF)E2E (RNN)Hybrid (FF)Hybrid (RNN)
Relative Error Rate Reduction [%]Fig. 12: Comparison of adaptation results for FF and RNNarchitectures split by hybrid and E2E systems.0 10 20 40 60 80SupervisedUnsupervised
Relative Error Rate Reduction [%]Fig. 13: Comparison of adaptation results for supervisionmodes. 2passenrol (Sup.)enrol (Unsup.)online
Relative Error Rate Reduction [%]Speaker AdaptationFig. 14: Comparison of adaptation results for differentadaptation targets: online adaptation, supervised andunsupervised enrolment, and two-pass decoding.
Relative Error Rate Reduction [%]Fig. 15: Comparison of adaptation results for differentamount of adaptation data.
VERVIEW PAPER SUBMITTED TO OJSP 20 − Amount of adaptation data [mins] R E RR [ % ] embeddingmodelfeature(a) Adaptation Type − Amount of adaptation data [mins] R E RR [ % ] HybridE2E(b) Hybrid vs. E2E − Amount of adaptation data [mins] R E RR [ % ] utterancespeakeraccentchilddomainmedical(c) Adaptation Clusters Fig. 16: Regression analysis for the three major control variables.0 10 20 40 60 80FinetuneActivationsLinear TransformNNTransformEmbNNEmbGMMEmb
Relative Error Rate Reduction [%]Fig. 17: Comparison of results for different adaptationapproaches.adaptation is more powerful and gives better results thanembedding or feature-based approaches; and iii) adaptation isparticularly effective when there is a large mismatch scenariosand obtaining matched training data is difficult.Since this meta-analysis combines results across manydifferent studies with many reference systems, the resultsare not necessarily to be compared at the sample level, butrather in an aggregated form to outline dominant trends andtypical data regimes each category was tried in. Data amountsfor some systems for the purpose of plotting were assumedapproximately to be at a given level: e.g. two-pass systemsunless shown otherwise assumed 10 minutes per speaker, whileembedding approaches 30 seconds.
C. Detailed findings
In this subsection we investigate the effect of the specificapproach to adaptation, beyond the broad categories discussedabove. Fig. 17 reorganizes the earlier split into feature, em-bedding, and model-level adaptation (Fig. 8) into embedding( cf.
Sec. V) and model-based transformations ( cf.
Sec. VI).For the embeddings, we introduce three sub-categoriesreferred to as GMMEmb, NNEmb and NNTransformEmb.GMMEmb comprises GMM-related embedding extractors pri-marily based on ivectors [8], [58], [105] but also include adap-tation results for other GMM-derived (GMMD) features [144]. NNEmb are neural network-based embedding extractors thatestimate speaker/utterance statistics from speaker-independentacoustic features. Examples of NNEmb approaches include ⋆ -vector techniques, such as d-vectors [135] and x-vectors [106],discussed in Sec. V, sentence-level embeddings [60], [122],and other bottleneck approaches [124], [126]. NNTrans-formEmb are transformed embeddings which typically relyon ivectors as input instead of acoustic features. These havebeen proposed to help alleviate issues related to inconsistentDNN adaptation performance when using raw ivectors [58],[59], [302]. The NNTransformEmb group includes studiesdoing standard ivector transformations with NNEmb [70],[133], [211] but also more recent memory-based approachesin which an embedding is selected via attention from a fixedat training stage embedding inventory [62], [63]. As shownin Fig. 17 GMMEmb, NNEmb and NNTransformEmb obtain8.1%, 5.2% and 9.2% average RERR, respectively.The second group in Fig. 17 comprises model-based ap-proaches split into Linear Transform (LT), Activation, andFinetuning–based methods. LT methods introduce new speakerdependent affine transformations in the model, either in theform of new LIN/LHN/LON layers ( i.e. [141], [149], [151],[274]) or transforms estimated using a GMM system suchas fMLLR [8], [107], [141], [303]. Finetune refers to ap-proaches which assume that the adaptation is carried outby altering a subset of the existing model parameters. Thisis often done in a similar manner to an LT approach byadapting an input, output and/or one or more hidden layersthat are already present in the model [108], [147], [206], [207].Finally, activation methods perform adaptation by introducingspeaker-dependent parameters in the activation functions ofthe neural network [107], [156], [304]–[306]. Note that, asoutlined in Sec. VI, some of activation-based methods can beexpressed as constrained LT methods. The results obtained byLT, Activation and Finetune–based methods score 6.7%, 9.0%and 13.9% average RERR, respectively. Fig. 18 (a) shows theregression trends for amounts of adaptation data for each ofthe six considered categories.The use of embeddings implies the acoustic model is trained VERVIEW PAPER SUBMITTED TO OJSP 21 − Amount of adaptation data [mins] R E RR [ % ] NNEmbGMMEmbFinetuneActivationsLinear TransformNNTransformEmb(a) Adaptation Family − Amount of adaptation data [mins] R E RR [ % ] Test-onlySAT(b) SAT vs. Test-only − Amount of adaptation data [mins] R E RR [ % ] CET/SMTLOnlineCE-KLSeq-KLMetaMLEMAP(c) Adaptation Losses
Fig. 18: Regression analysis for adaptation families, speaker-adaptive training and adaptation losses.
SATTest-only
Relative Error Rate Reduction [%]Fig. 19: Comparison of adaptation results for SAT vs Test-only modes.in a speaker adaptive manner, whereas the majority of model-based techniques are carried out in a test-only manner –meaning that speaker-level information is not used duringtraining – though some methods offer SAT variants [161],[307]. Fig. 19 shows that SAT trained systems offer a smalladvantage (8% vs. 7.6% RERR) when adapted with limitedamounts of data (up to around 10 minutes). When looking atthe average performance across all data-points, however, test-only approaches obtain 10.8% RERR, primarily because ofgreater adaptation gains for larger amounts of data. See alsoFig. 18 (b) for operating regions of SAT and non-SAT systems.Fig. 20 quantifies gains for different adaptation objectivesand regularization approaches – results for the online condi-tion are given only for reference, as in this case adaptationinformation is obtained via an embedding extractor (which isusually not updated, although not always [211]). The secondgroup depicts approaches where the adaptation information isderived by adapting a GMM in model-space using an MLEor MAP criterion when extracting speaker-adapted auxiliaryfeatures for NN training [144], [308] or by estimating fMLLRtransforms with MLE under a GMM to obtain speaker adaptedacoustic features [8], [141], [303].The third group comprises methods which aim to explicitlymatch the model’s output distribution to the one found inadaptation data. CE is a non-regularized frame-level cross-entropy baseline obtaining 8.7% average RERR. This can beimproved to 14.8% average RERR by penalizing the adaptedmodel’s predictions such that they do not deviate too muchfrom the speaker independent variant by KL regularization(CE-KL) [109]. KL regularization can be applied to either CE or sequential objective functions [148], although most modelsestimated in sequential discriminative manner can be success-fully adapted with a CE (or CE-KL) criterion [71], [162],[189], [279] (see also Fig. 21). Teacher-student (T/S) [229]is a special case (see Sec. XI) where the adaptation is carriedwith the targets directly produced by a teacher model, ratherthan the ones obtained from first pass decodes (possiblyKL-regularized with the SI model). T/S allows to leveragelarge amounts of unsupervised data and in this analysis wasfound to offer an average 28.2% RERR when adapting todomains [222], [231], [243].The final group in Fig. 20 includes objectives that try toleverage auxiliary information at the objective function level.Meta-learning [172], [173], [225] estimates the adaptationhyper-parameters jointly with the adaptation transform whilemulti-task learning [110], [113], [181], [271] leverages addi-tional phonetic priors to circumvent the (potential) sparsity ofsenones when adapting with small amounts of data. Meta-learning and multi-task adaptation obtain 6.8% and 7.6%average RERR, respectively. See also Fig. 18 (c).Fig. 21 further summarizes the adaptability of acousticmodels trained in a frame-based (CE) or a sequential (Seq)manner. The results indicate that sequential models benefitmore from adaptation when compared to frame-based systems(11.6% vs. 9.8% average RERR). However, when controllingfor the same data-set and baseline (reference systems wereexpected to exist for both CE and Seq) the difference decreasesto around 0.6% RERR in favor of the frame-based systems.Fig. 22 compares the adaptation gains obtained using var-ious model architectures. LSTM benefits the most (15.4%average RERR). The feed-forward TDNN, DNN, and ResNetarchitectures all improve by around 10.5% RERR. Smallergains were observed for Transformer, CNN and BLSTM,improving by 7.6, 6.5 and 4.9% average RERR, respectively.This result is somewhat expected as the last three architectureseither normalize some of the variability by design, or haveaccess to a larger speech context during recognition.In Fig. 23 we study complementarity of the different adap-tation techniques. These results are based on 22 samples and
VERVIEW PAPER SUBMITTED TO OJSP 22
Relative Error Rate Reduction [%]Fig. 20: Comparison of results for different adaptation lossfunctions.CESeq
All samples
CESeq
Relative Error Rate Reduction [%]Controlling for data-set and baselineFig. 21: Comparison of adaptation results for acousticmodels trained with CE and Sequence-level objectives.0 10 20 30 40 60 80BLSTMCNNDNNLSTMResNetTDNNTransformer
Relative Error Rate Reduction [%]Fig. 22: Comparison of adaptation results for differentarchitectures.
Relative Error Rate Reduction [%]
Method1 (Avg)Method1+Method2 (Avg)
Fig. 23: Complementarity of selected adaptation techniques.6 studies for which there were a complete set of baselineexperiments allowing improvements to be quantified whenadapting an SI model with Method1, and then measuringfurther gains when adding Method2. Fig. 23 shows that, on av-erage, stacking adaptation techniques improved the adaptationperformance by an additional 4%, from 8% to 12% RERR.Finally, in Fig. 24 we report results for all techniquesincluded in the meta-analysis. These are based on sampleswhere only a single method was used to adapt the acousticmodel ( cf.
Fig. 7 (middle)), spanning results for all adaptationclusters ( cf.
Fig. 9). These should not be directly comparedowing to differences in operating regions, but they offer anindication of the performance of the individual methods.
D. Speech styles, applications, languages
In this subsection, we analyze the efficacy of adaptationmethods across acoustic and linguistic dimensions by reportingadaptation gains for different types of speech styles, applica-tions (including ones with a large mismatch to the trainingconditions), and languages.Fig. 25 compares gains as obtained for different speechstyles. At the top we report three special cases spanningdisordered, children’s, and accented speech (these are similarto the adaptation clusters from Fig. 9). As expected, acousticmodels estimated largely from adult speech of healthy indi-viduals perform poorly in these highly mismatched domains,especially for disordered and children’s speech, and domainadaptation improves ASR by over 50% average RERR.Performance gains from adapting models with accentedspeech are similar to that obtained on other speech tasks.
VERVIEW PAPER SUBMITTED TO OJSP 23
Relative Error Rate Reduction [%]Fig. 24: Comparison of adaptation results for the standalone techniques.Note that the presence of non-native speakers in (English)training corpora is fairly common, so the underlying acousticmodels may learn to better normalize this variability at trainingstage. Interestingly, adaptation brings relatively larger gains incommercial applications such as VoiceSearch and Dictationtasks (14% RERR on average). This is also visible in Fig. 26comparing performance on public and proprietary data. Wehypothesize that commercial data is to more likely containa mix of speech from a diverse set of speakers (includingnon-native and children speech) and thus benefits more fromadaptation. Another explanation could be the public bench-marks have been around for some time, and systems build onthese are likely to be more over-fitted in general.Finally, Fig. 27 summarizes adaptation performance forseveral languages. Note that speaker adaptation was performedon English, French, Japanese, and Mandarin while for Koreanand Italian we only report adaptation gains for disorderedand children’s speech recognition. The overall improvementsfor non-English languages when adapting to speakers aresimilar to gains obtained for English when controlling for theadaptation method ( i.e. improvements are between 6 and 10%average RERR), giving some evidence that adaptation helps toa similar degree for different languages, and that some of theseprimarily English-based findings generalize across languages. 0 10 20 30 40 50 60 70 80BroadcastConversationalDictationLecturesMeetingsReadRead-NoisyVoiceSearchAccentedChildDisordered
Relative Error Rate Reduction [%]Fig. 25: Comparison of adaptation results for differentspeech stylesXIV. S
UMMARY AND DISCUSSION
The rapid developments in speech recognition over the pastdecade have been driven by deep neural network modelsof acoustics, deployed in both hybrid and E2E systems.
VERVIEW PAPER SUBMITTED TO OJSP 24
ProprietaryPublic
Speaker Adaptation0 10 20 40 60 80ProprietaryPublic
Relative Error Rate Reduction [%]Domain AdaptationFig. 26: Performance of adaptation techniques as obtainedon public and proprietary data-sets.0 10 20 40 60 80EnglishFrenchItalianJapaneseKoreanMandarin
Relative Error Rate Reduction [%]Fig. 27: Adaptation gains for different languages.Compared to the previous state-of-the-art approaches based onGMMs, neural network-based systems have less constrainedand more flexible models and are open to a richer set ofadaptation algorithms, compared to previous approaches basedon linear transforms of the model parameters and acousticfeatures.In this overview article we have surveyed approaches tothe adaptation of neural network-based speech recognitionsystems. We structured the field into embedding-based, model-based, and data augmentation adaptation approaches, arguingthat this organization gives a more coherent understanding ofthe field compared with the usual split into feature-based andmodel-based approaches. We presented these adaptation algo-rithms in the context of speaker adaptation, with a discussionon their application to accent and domain adaptation.A key aspect of this overview was a meta-analysis of recentpublished results for the adaptation of speech recognition sys-tems. The meta-analysis indicates that adaptation algorithmsapply successfully to both hybrid and E2E systems, acrossdifferent corpora and adaptation classes.E2E modeling is less mature than the hybrid approach, andmuch of the research focus on E2E modeling is to improvethe general modeling technology. Therefore, in this overviewpaper, many more adaptation methods were introduced in thecontext of hybrid systems. However, most adaptation tech-nologies successfully applied to hybrid models by adaptingacoustic model or language model should also work wellfor E2E models because E2E models usually contain sub-networks corresponding to the acoustic model and languagemodel in hybrid models; this is supported by findings in ourmeta-analysis. Different from hybrid models in which components areoptimized separately, E2E models are optimized using a singleobjective function. Therefore, E2E models tend to memorizethe training data more and hence the generalization or ro-bustness to unseen data [185] is challenging to E2E models.Consequently, adaptation to new environment or new domainis very important to the large scale application of E2E models.We would expect more research toward this direction as E2Emodeling becomes increasingly mainstream in ASR.Because the size of E2E models is much smaller than that ofhybrid models, E2E models have clear advantages when beingdeployed to device. Therefore, personalization or adaptationof E2E models [114], [115], [120], [121] is a rapidly growingarea. While it possible to adapt every user’s model on cloudand then push it back to device, it is more reasonable to adaptthe model on device, which needs to adjust the adaptationalgorithm to overcome the challenge of limited memory andcomputation power [114]. Another interesting direction forthe adaptation of E2E models is how to leverage unpaireddata especially text only data in a new domain. In [121],several methods have been explored in this direction, but weare expecting more innovations there.Adaptation algorithms are often deployed for conditions inwhich there is very limited labeled data, or none at all. In thiscase unsupervised and semi-supervised learning approachesare central, and indeed many current adaptation approachesstrongly leverage such algorithms. However there are signifi-cant open research challenges in this area, particularly relatingto unsupervised and semi-supervised training of E2E systems,using methods which are able to propagate uncertainty. Currentapproaches often do this indirectly ( e.g. through T/S training),but more direct modeling of uncertainty would be desirable.Domain adaptation has become central to work in com-puter vision and image processing, as discussed in Sec. I,with large scale base models (typically trained on ImageNet)being adapted to specific tasks. The closest analogies to thisin speech recognition are some of the domain recognitionapproaches discussed in Sec. XI and for multilingual speechrecognition. The idea of shared multilingual representationsand language-specific or language-adaptive output layers wasproposed in 2013 [208]–[210] and has become a standardarchitectural pattern. More recently several authors have pro-posed highly multilingual E2E systems, with a shared multilin-gual output layer [309]–[312], with the potential to be adaptedto new languages.State-of-the-art NLP systems are characterized by an unsu-pervised, large-scale base model [45], [301] which may thenbe adapted to specific domains and tasks [46]. An analogousapproach for speech recognition would be based on the unsu-pervised learning of speech representations, from diverse andpotentially multilingual speech recordings. Initial work in thisdirection includes the unsupervised learning from large-scalemultilingual speech data [313], [314]. More generally, deepprobabilistic generative modeling has become a highly activeresearch area, in particular through approaches such as normal-izing flows [53], [54], [56], [57]. Such deep generative modelsoffer different ways of addressing the problem of adaptationincluding powerful approaches to data augmentation, and the
VERVIEW PAPER SUBMITTED TO OJSP 25 development of rich adaptation algorithms building on a basemodel with a joint distribution over acoustics and symbols.This offers the possibility of finetuning general encoders tospecific acoustic domains, and adapting the decoder to specifictasks (such as speech recognition, speaker identification, lan-guage recognition, or emotion recognition), noting that classicadaptation to speakers can bring further gains [315], [316].R
EFERENCES[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups,”
IEEE Signal ProcessingMagazine , vol. 29, no. 6, pp. 82–97, November 2012.[2] F. Seide, G. Li, and D. Yu, “Conversational speech transcription usingcontext-dependent deep neural networks,” in
Interspeech , 2011.[3] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis,X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall,“English conversational telephone speech recognition by humans andmachines,” in
Interspeech , 2017.[4] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li,J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognitionwith sequence-to-sequence models,” in
IEEE ICASSP , 2018.[5] H. Christensen, S. Cunningham, C. Fox, P. Green, and T. Hain, “Acomparative study of adaptive, automatic recognition of disorderedspeech,” in
Interspeech , 2012.[6] H. Liao, E. McDermott, and A. Senior, “Large scale deep neuralnetwork acoustic modeling with semi-supervised training data forYouTube video transcription,” in
IEEE ASRU , 2013, pp. 368–373.[7] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, “Large-scaledomain adaptation via teacher-student learning,” in
Interspeech , 2017.[8] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptationof neural network acoustic models using i-vectors,” in
IEEE ASRU ,2013.[9] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptationfor robust speech recognition via variational autoencoder-based dataaugmentation,” in
IEEE ASRU , 2017, pp. 16–23.[10] S. Tan and K. C. Sim, “Learning utterance-level normalisation usingvariational autoencoders for robust automatic speech recognition,” in
IEEE SLT , 2016, pp. 43–49.[11] M. J. Gales, “Cluster adaptive training of hidden Markov models,”
IEEE Transactions on Speech and Audio Processing , vol. 8, no. 4, pp.417–428, 2000.[12] T. Tan, Y. Qian, and K. Yu, “Cluster adaptive training for deep neuralnetwork based acoustic model,”
IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 24, no. 3, pp. 459–468, 2016.[13] P. C. Woodland, “Speaker adaptation for continuous density HMMs:A review,” in
ISCA Workshop on Adaptation Methods for SpeechRecognition , 2001.[14] K. Shinoda, “Speaker adaptation techniques for automatic speechrecognition,”
APSIPA ASC , 2011.[15] N. Morgan and H. A. Bourlard, “Neural networks for statisticalrecognition of continuous speech,”
Proceedings of the IEEE , vol. 83,no. 5, pp. 742–772, 1995.[16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”in
IEEE ICASSP , 2016, pp. 4960–4964.[17] T. Robinson, M. Hochberg, and S. Renals, “The use of recurrent neuralnetworks in continuous speech recognition,” in
Automatic Speech andSpeaker Recognition , C.-H. Lee, F. K. Soong, and K. K. Paliwal, Eds.Kluwer, 1996, pp. 233–258.[18] A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J.Renals, and D. A. G. Williams, “Connectionist speech recognition ofbroadcast news,”
Speech Communication , vol. 37, pp. 27–45, 2002.[19] D. J. Kershaw, A. J. Robinson, and M. Hochberg, “Context-dependentclasses in a hybrid recurrent network-HMM speech recognition sys-tem,” in
Advances in Neural Information Processing Systems , 1996.[20] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang,“Phoneme recognition using time-delay neural networks,”
IEEE Trans-actions on Acoustics, Speech, and Signal Processing , vol. 37, no. 3,pp. 328–339, 1989. [21] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural networkarchitecture for efficient modeling of long temporal contexts.” in
Interspeech , 2015.[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,”
Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[23] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn,and D. Yu, “Convolutional neural networks for speech recognition,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 22, no. 10, pp. 1533–1545, 2014.[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComputation , vol. 9, no. 8, pp. 1735–1780, 1997.[25] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in
IEEE ICASSP , 2013.[26] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-works,”
IEEE transactions on Signal Processing , vol. 45, no. 11, pp.2673–2681, 1997.[27] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognitionwith deep bidirectional LSTM,” in
IEEE ASRU , 2013.[28] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in
Advances in NeuralInformation Processing Systems , 2015, pp. 577–585.[29] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speechrecognition using deep RNN models and WFST-based decoding,” in
IEEE ASRU , 2015, pp. 167–174.[30] L. Lu, X. Zhang, K. Cho, and S. Renals, “A study of the recurrent neu-ral network encoder-decoder for large vocabulary speech recognition,”in
Interspeech , 2015.[31] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data andunits for streaming end-to-end speech recognition with rnn-transducer,”in
IEEE ASRU , 2017, pp. 193–199.[32] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu,S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers forend-to-end speech recognition,” in
IEEE ASRU , 2017, pp. 206–213.[33] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,D. Rybach, A. Kannan, Y. Wu, R. Pang et al. , “Streaming end-to-endspeech recognition for mobile devices,” in
IEEE ICASSP , 2019.[34] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducermodeling for end-to-end speech recognition,” in
IEEE ASRU , 2019.[35] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connection-ist temporal classification: labelling unsegmented sequence data withrecurrent neural networks,” in
ICML , 2006, pp. 369–376.[36] A. Hannun, “Sequence modeling with CTC,”
Distill , 2017,https://distill.pub/2017/ctc.[37] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711 , 2012.[38] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,”
Proceedings of the IEEE , vol. 77,no. 2, pp. 257–286, 1989.[39] J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, and S. Liu, “On the compar-ison of popular end-to-end models for large scale speech recognition,”in
Interspeech , 2020.[40] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“ImageNet: A large-scale hierarchical image database,” in
CVPR , 2009.[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,”
InternationalJournal of Computer Vision , vol. 115, no. 3, pp. 211–252, 2015.[42] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,D. Mollura, and R. M. Summers, “Deep convolutional neural networksfor computer-aided detection: CNN architectures, dataset characteris-tics and transfer learning,”
IEEE Transactions on Medical Imaging ,vol. 35, no. 5, pp. 1285–1298, 2016.[43] S. Kornblith, J. Shlens, and Q. V. Le, “Do better ImageNet modelstransfer better?” in
CVPR , 2019.[44] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,and L. Zettlemoyer, “Deep contextualized word representations,” in
NAACL/HLT , 2018, pp. 2227–2237.[45] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,”in
NAACL/HLT , 2019, pp. 4171–4186.[46] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learningwith a unified text-to-text transformer,”
Journal of Machine LearningResearch , vol. 21, no. 140, pp. 1–67, 2020.
VERVIEW PAPER SUBMITTED TO OJSP 26 [47] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung,B. Strope, and R. Kurzweil, “Universal sentence encoder for English,”in
EMNLP , 2018, pp. 169–174.[48] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transferlearning for NLP,” in
ICML , 2019, pp. 2790–2799.[49] W. M. Kouw and M. Loog, “A review of domain adaptation withouttarget labels,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2019, early access.[50] S. Ravi and H. Larochelle, “Optimization as a model for few-shotlearning,” in
ICLR , 2016.[51] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in
Advances in Neural Information Processing Systems ,2017, pp. 4077–4087.[52] X. Li, S. Dalmia, D. R. Mortensen, J. Li, A. W. Black, and F. Metze,“Towards zero-shot learning for automatic phonemic transcription.” in
AAAI , 2020, pp. 8261–8268.[53] D. Rezende and S. Mohamed, “Variational inference with normalizingflows,” in
ICML , 2015, pp. 1530–1538.[54] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, andB. Lakshminarayanan, “Normalizing flows for probabilistic modelingand inference,” arXiv:1912.02762 , 2019.[55] S. S. Chen and R. A. Gopinath, “Gaussianization,” in
Advances inNeural Information Processing Systems , 2001, pp. 423–429.[56] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-basedgenerative network for speech synthesis,” in
IEEE ICASSP , 2019, pp.3617–3621.[57] J. Serr`a, S. Pascual, and C. S. Perales, “Blow: a single-scale hypercon-ditioned flow for non-parallel raw-audio voice conversion,” in
Advancesin Neural Information Processing Systems , 2019, pp. 6793–6803.[58] A. Senior and I. Lopez-Moreno, “Improving DNN speaker indepen-dence with i-vector inputs,” in
IEEE ICASSP , 2014, pp. 225–229.[59] P. Karanasou, Y. Wang, M. J. Gales, and P. C. Woodland, “Adaptationof deep neural network acoustic models using factorised i-vectors,” in
Fifteenth Annual Conference of the International Speech Communica-tion Association , 2014.[60] K. Vesel´y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karafi´at, L. Burget, andJ. H. ˇCernock´y, “Sequence summarizing neural network for speakeradaptation,” in
IEEE ICASSP , 2016.[61] M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichletallocation based organisation of broadcast media archives for deepneural network adaptation,” in
IEEE ASRU , 2015, pp. 130–136.[62] J. Pan, G. Wan, J. Du, and Z. Ye, “Online speaker adaptation usingmemory-aware networks for speech recognition,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 28, pp. 1025–1037, 2020.[63] L. Sarı, N. Moritz, T. Hori, and J. Le Roux, “Unsupervised speakeradaptation using attention-based speaker memory for end-to-end ASR,”in
IEEE ICASSP , 2020, pp. 7384–7388.[64] Z.-P. Zhang, S. Furui, and K. Ohtsuki, “On-line incremental speakeradaptation with automatic speaker change detection,” in
IEEE ICASSP ,2000, pp. II.961–II.964.[65] H. Huang and K. C. Sim, “An investigation of augmenting speaker rep-resentations to improve speaker normalisation for DNN-based speechrecognition,” in
IEEE ICASSP , 2015.[66] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, andO. Vinyals, “Speaker diarization: A review of recent research,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 20,no. 2, pp. 356–370, 2012.[67] L. Mathias, G. Yegnanarayanan, and J. Fritsch, “Discriminative trainingof acoustic models applied to domains with unreliable transcripts[speech recognition applications],” in
IEEE ICASSP , 2005.[68] S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investigating dataselection for minimum phone error training of acoustic models,” in
IEEE ICME , 2007.[69] S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-supervisedmodel training for unbounded conversational speech recognition,” arXiv:1705.09724 , 2017.[70] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive training ofdeep neural network acoustic models using i-vectors,”
IEEE/ACMTransactions on Audio, Speech and Language Processing , vol. 23,no. 11, pp. 1938–1949, 2015.[71] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contribu-tions for unsupervised acoustic model adaptation,”
IEEE Transactionson Audio, Speech, and Language Processing , vol. 14, pp. 1450–1463,2016. [72] M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based unsupervisedMLLR for speaker adaptation,” in
ISCA ASR2000 Workshop , 2000.[73] T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice-based unsuper-vised acoustic model training,” in
IEEE ICASSP , 2011.[74] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervisedtraining of acoustic models using lattice-free MMI,” in
IEEE ICASSP ,2018.[75] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Lattice-based un-supervised test-time adaptation of neural network acoustic models,” arXiv:1906.11521 , 2019.[76] H. Suzuki, H. Kasuya, and K. Kido, “The acoustic parameters forvowel recognition without distinction of speakers,” in
Proc. 1967 Conf.Speech Comm. and Process , 1967, pp. 92–96.[77] L. Gerstman, “Classification of self-normalized vowels,”
IEEE Trans-actions on Audio and Electroacoustics , vol. 16, no. 1, pp. 78–80, 1968.[78] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linearregression for speaker adaptation of continuous density hidden Markovmodels,”
Computer Speech & Language , vol. 9, no. 2, pp. 171–185,1995.[79] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation formultivariate Gaussian mixture observations of Markov chains,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 2, no. 2,pp. 291–298, 1994.[80] M. Gales and S. Young, “The application of hidden Markov modelsin speech recognition,”
Foundations and Trends in Signal Processing ,vol. 1, no. 3, pp. 195–304, 2008.[81] K. Johnson, “Speaker normalization in speech perception,” in
TheHandbook of Speech Perception . Wiley Online Library, 2005, pp.363–389.[82] K. Paliwal and W. Ainsworth, “Dynamic frequency warping for speakeradaptation in automatic speech recognition,”
Journal of Phonetics ,vol. 13, no. 2, pp. 123 – 134, 1985.[83] Y. Grenier, “Speaker adaptation through canonical correlation analysis,”in
IEEE ICASSP , vol. 5, 1980, pp. 888–891.[84] K. Choukri and G. Chollet, “Adaptation of automatic speech recogniz-ers to new speakers using canonical correlation analysis techniques,”
Computer Speech & Language , vol. 1, no. 2, pp. 95–107, 1986.[85] H. Wakita, “Normalization of vowels by vocal-tract length and itsapplication to vowel identification,”
IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 25, no. 2, pp. 183–192, 1977.[86] A. Andreou, “Experiments in vocal tract normalization,” in
CAIPWorkshop: Frontiers in Speech Recognition II , 1994.[87] E. Eide and H. Gish, “A parametric approach to vocal tract lengthnormalization,” in
Interspeech , 1996.[88] L. Lee and R. C. Rose, “Speaker normalization using efficient fre-quency warping procedures,” in
IEEE ICASSP , 1996.[89] D. Kim, S. Umesh, M. Gales, T. Hain, and P. Woodland, “Using VTLNfor broadcast news transcription,” in
ICSLP , 2004.[90] G. Garau, S. Renals, and T. Hain, “Applying vocal tract lengthnormalization to meeting recordings,” in
Interspeech , 2005.[91] S. Furui, “A training procedure for isolated word recognition sys-tems,”
IEEE Transactions on Acoustics, Speech, and Signal Processing ,vol. 28, no. 2, pp. 129–136, 1980.[92] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voiceconversion by codebook mapping,” in
IEEE International Symposiumon Circuits and Systems , 1991, pp. 594–597.[93] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conver-sion through vector quantization,”
Journal of the Acoustical Society ofJapan (E) , vol. 11, no. 2, pp. 71–76, 1990.[94] M. Feng, F. Kubala, R. Schwartz, and J. Makhoul, “Improved speakeradaption using text dependent spectral mappings,” in
IEEE ICASSP ,vol. 1, 1988, pp. 131–134.[95] G. Rigoll, “Speaker adaptation for large vocabulary speech recognitionsystems using speaker markov models,” in
IEEE ICASSP , 1989, pp.5–8.[96] M. J. Hunt, “Speaker adaptation for word-based speech recognitionsystems,”
The Journal of the Acoustical Society of America , vol. 69,no. S1, pp. S41–S42, 1981.[97] S. J. Cox and J. S. Bridle, “Unsupervised speaker adaptation byprobabilistic spectrum fitting,” in
IEEE ICASSP , vol. 1, 1989, pp. 294–297.[98] M. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,”
Computer speech & language , vol. 12, no. 2,pp. 75–98, 1998.[99] L. Neumeyer, A. Sankar, and V. Digalakis, “A comparative study ofspeaker adaptation techniques,” in
Eurospeech , 1995.
VERVIEW PAPER SUBMITTED TO OJSP 27 [100] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “Acompact model for speaker-adaptive training,” in
ICSLP , 1996.[101] D. Povey, H.-K. J. Kuo, and H. Soltau, “Fast speaker adaptive trainingfor speech recognition,” in
Interspeech , 2008.[102] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speakeradaptation in eigenvoice space,”
IEEE Transactions on Speech andAudio Processing , vol. 8, no. 6, pp. 695–707, 2000.[103] K. Yu and M. J. Gales, “Discriminative cluster adaptive training,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 14,no. 5, pp. 1694–1703, 2006.[104] K. C. Sim, Y. Qian, G. Mantena, L. Samarakoon, S. Kundu, andT. Tan, “Adaptation of deep neural network acoustic models forrobust automatic speech recognition,” in
New Era for Robust SpeechRecognition: Exploiting Deep Learning , S. Watanabe, M. Delcroix,F. Metze, and J. R. Hershey, Eds. Springer, 2017, pp. 219–243.[105] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,2011.[106] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,” in
IEEEICASSP , 2018.[107] P. Swietojanski and S. Renals, “Learning hidden unit contributions forunsupervised speaker adaptation of neural network acoustic models,”in
IEEE SLT , 2014.[108] H. Liao, “Speaker adaptation of context dependent deep neural net-works,” in
IEEE ICASSP , 2013.[109] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularizeddeep neural network adaptation for improved large vocabulary speechrecognition,” in
IEEE ICASSP , 2013.[110] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H.Lee, “Rapid adaptation for deep neural networks through multi-tasklearning,” in
Interspeech , 2015.[111] Y. Huang, L. He, W. Wei, W. Gale, J. Li, and Y. Gong, “Usingpersonalized speech synthesis and neural language generator for rapidspeaker adaptation,” in
IEEE ICASSP , 2020.[112] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation forend-to-end CTC models,” in
IEEE SLT , 2018, pp. 542–549.[113] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Speaker adaptation for attention-based end-to-end speech recognition,” in
Interspeech , 2019.[114] K. C. Sim, P. Zadrazil, and F. Beaufays, “An investigation into on-device personalization of end-to-end automatic speech recognitionmodels,” in
Interspeech , 2019.[115] Y. Huang, J. Li, L. He, W. Wei, W. Gale, and Y. Gong, “Rapid RNN-T adaptation using personalised speech synthesis and neural languagegenerator,” in
Interspeech , 2020.[116] Z. Fan, J. Li, S. Zhou, and B. Xu, “Speaker-aware speech-transformer,”in
IEEE ASRU , 2019, pp. 222–229.[117] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao,“Deep context: end-to-end contextual speech recognition,” in
IEEESLT , 2018, pp. 418–425.[118] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “End-to-endcontextual speech recognition using class language models and a tokenpassing decoder,” in
IEEE ICASSP , 2019, pp. 6186–6190.[119] M. Jain, G. Keren, J. Mahadeokar, and Y. Saraf, “Contextual RNN-Tfor open domain ASR,” arXiv:2006.03411 , 2020.[120] K. C. Sim, F. Beaufays, A. Benard et al. , “Personalization of end-to-end speech recognition on mobile devices for named entities,” arXiv:1912.09251 , 2019.[121] J. Li, R. Zhao, Z. Meng, Y. Liu, W. Wei, P. Parthasarathy, V. Mazalov,Z. Wang, L. He, S. Zhao, and Y. Gong, “Developing RNN-T modelssurpassing high-performance hybrid models with customization capa-bility,” in
Interspeech , 2020.[122] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani,“Auxiliary feature based adaptation of end-to-end ASR systems,” in
Interspeech , 2018.[123] M. Delcroix, K. Kinoshita, A. Ogawa, C. Huemmer, and T. Nakatani,“Context adaptive neural network based acoustic models for rapidadaptation,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 26, no. 5, pp. 895–908, 2018.[124] J. Rownicka, P. Bell, and S. Renals, “Embeddings for DNN speakeradaptive training,”
IEEE ASRU , 2019.[125] S. Garimella, A. Mandal, N. Strom, B. Hoffmeister, S. Matsoukas,and S. H. K. Parthasarathi, “Robust i-vector based adaptation of DNNacoustic model for speech recognition,” in
Interspeech , 2015. [126] T. Tan, Y. Qian, D. Yu, S. Kundu, L. Lu, K. C. Sim, X. Xiao,and Y. Zhang, “Speaker-aware training of LSTM-RNNs for acousticmodelling,” in
IEEE ICASSP , 2016, pp. 5280–5284.[127] S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal,N. Strom, and S. Garimella, “fMLLR based feature-space speakeradaptation of DNN acoustic models,” in
Interspeech , 2015.[128] M. Karafi´at, L. Burget, P. Matˇejka, O. Glembek, and J. ˇCernock`y,“iVector-based discriminative adaptation for automatic speech recogni-tion,” in
IEEE ASRU , 2011, pp. 152–157.[129] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speakerand session variability in GMM-based speaker verification,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 15,no. 4, pp. 1448–1460, 2007.[130] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme forspeaker recognition using a phonetically-aware deep neural network,”in
IEEE ICASSP , 2014, pp. 1695–1699.[131] P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam,“Deep neural networks for extracting Baum-Welch statistics for speakerrecognition.” in
Speaker Odyssey , vol. 2014, 2014, pp. 293–298.[132] Y. Miao, H. Zhang, and F. Metze, “Towards speaker adaptive trainingof deep neural network acoustic models,” in
Interspeech , 2014.[133] Y. Miao, L. Jiang, H. Zhang, and F. Metze, “Improvements to speakeradaptive training of deep neural networks,” in
IEEE SLT , 2014, pp.165–170.[134] J. Rownicka, P. Bell, and S. Renals, “Analyzing deep CNN-basedutterance embeddings for acoustic model adaptation,” in
IEEE SLT ,2018, pp. 235–241.[135] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker verification,” in
IEEE ICASSP , 2014.[136] X. Li and X. Wu, “Modeling speaker variability using long short-termmemory networks for speech recognition,” in
Interspeech , 2015.[137] Y. Khokhlov, A. Zatvornitskiy, I. Medennikov, I. Sorokin, T. Prisyach,A. Romanenko, A. Mitrofanov, V. Bataev, A. Andrusenko, M. Ko-renevskaya et al. , “R-vectors: New technique for adaptation to roomacoustics,”
Interspeech , 2019.[138] Z. Meng, H. Hu, J. Li, C. Liu, Y. Huang, Y. Gong, and C.-H. Lee,“L-vector: Neural label embedding for domain adaptation,” in
IEEEICASSP , 2020, pp. 7389–7393.[139] Y. Shi, Q. Huang, and T. Hain, “H-vectors: Utterance-level speakerembedding using a hierarchical attention model,” in
IEEE ICASSP ,2020, pp. 7579–7583.[140] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic modelspeaker adaptation by learning hidden unit contribution features,” in
Interspeech , 2019, pp. 759–763.[141] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcrip-tion,” in
IEEE ASRU , 2011, pp. 24–29.[142] S. P. Rath, D. Povey, K. Vesel`y, and J. Cernock`y, “Improved featureprocessing for deep neural networks,” in
Interspeech , 2013, pp. 109–113.[143] N. M. Joy, M. K. Baskar, S. Umesh, and B. Abraham, “DNNs forunsupervised extraction of pseudo FMLLR features without explicitadaptation data,” in
Interspeech , 2016, pp. 3479–3483.[144] N. Tomashenko and Y. Khokhlov, “GMM-derived features for effectiveunsupervised adaptation of deep neural network acoustic models,” in
Interspeech , 2015.[145] L. F. Uebel and P. C. Woodland, “An investigation into vocal tractlength normalisation,” in
Eurospeech , 1999.[146] D. Povey, G. Zweig, and A. Acero, “Speaker adaptation with anexponential transform,” in
IEEE ASRU , 2011, pp. 158–163.[147] J. Fainberg, O. Klejch, E. Loweimi, P. Bell, and S. Renals, “Acousticmodel adaptation from raw waveforms with SincNet,” in
IEEE ASRU ,2019.[148] Y. Huang and Y. Gong, “Regularized sequence-level deep neuralnetwork model adaptation,” in
Interspeech , 2015.[149] J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, andT. Robinson, “Speaker-adaptation for hybrid HMM-ANN continuousspeech recognition system,”
Eurospeech , 1995.[150] R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Linearhidden transformations for adaptation of hybrid ANN/HMM models,”
Speech Communication , vol. 49, no. 10-11, pp. 827–835, 2007.[151] B. Li and K. C. Sim, “Comparison of discriminative input and outputtransformations for speaker adaptation in the hybrid nn/hmm systems,”in
Interspeech , 2010.
VERVIEW PAPER SUBMITTED TO OJSP 28 [152] J. S. Bridle and S. J. Cox, “RecNorm: Simultaneous normalisation andclassification applied to speech recognition,” in
Advances in NeuralInformation Processing Systems , 1991, pp. 234–240.[153] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybridNN/HMM model for speech recognition based on discriminative learn-ing of speaker code,” in
IEEE ICASSP , 2013.[154] Z. Q. Wang and D. Wang, “Unsupervised speaker adaptation of batchnormalized acoustic models for robust ASR,” in
IEEE ICASSP , 2017.[155] F. Mana, F. Weninger, R. Gemello, and P. Zhan, “Online batchnormalization adaptation for automatic speech recognition,” in
ASRU ,2019.[156] C. Zhang and P. C. Woodland, “Parameterised sigmoid and ReLU hid-den activation functions for DNN acoustic modelling,” in
Interspeech ,2015.[157] Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation fordeep neural networks,” in
IEEE ICASSP , 2016.[158] Y. Zhao, J. Li, K. Kumar, and Y. Gong, “Extended low-rank plusdiagonal adaptation for deep and recurrent neural networks,” in
IEEEICASSP , 2017.[159] V. Abrash, H. Franco, A. Sankar, and M. Cohen, “Connectionistspeaker normalization and adaptation,” in
Eurospeech , 1995.[160] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation ofcontext-dependent deep neural networks for automatic speech recogni-tion,” in
IEEE SLT , 2012.[161] L. Samarakoon and K. C. Sim, “Learning factorized feature transformsfor speaker normalization,” in
IEEE ASRU , 2015.[162] ——, “Factorized hidden layer adaptation for deep neural networkbased acoustic modeling,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 24, no. 12, pp. 2241–2250, 2016.[163] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptivemixtures of local experts,”
Neural Computation , vol. 3, no. 1, pp. 79–87, 1991.[164] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:Visual reasoning with a general conditioning layer,” in
AAAI , 2018.[165] L. Samarakoon and K. C. Sim, “Subspace LHUC for fast adaptationof deep neural network acoustic models.” in
Interspeech , 2016.[166] X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptivetraining of deep neural networks,” in
Interspeech , 2017.[167] T. Kim, I. Song, and Y. Bengio, “Dynamic layer normalization foradaptive neural acoustic modeling in speech recognition,” in
Inter-speech , 2017.[168] L. Sarı, S. Thomas, M. Hasegawa-Johnson, and M. Picheny, “Speakeradaptation of neural networks with learning speaker aware offsets,” in
Interspeech , 2019.[169] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic modelspeaker adaptation by learning hidden unit contribution features,”
Interspeech , 2019.[170] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural networkacoustic models with singular value decomposition.” in
Interspeech ,2013, pp. 2365–2369.[171] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decom-position based low-footprint speaker adaptation and personalization fordeep neural network,” in
IEEE ICASSP , 2014.[172] O. Klejch, J. Fainberg, and P. Bell, “Learning to adapt: a meta-learningapproach for speaker adaptation,” in
Interspeech , 2018.[173] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Speaker adaptivetraining using model agnostic meta-learning,” in
IEEE ASRU , 2019.[174] F. Weninger, J. Andr´es-Ferrer, X. Li, and P. Zhan, “Listen, attend,spell and adapt: Speaker adapted sequence-to-sequence ASR,” in
Interspeech , 2019.[175] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in
IEEEICASSP , 2019.[176] Z. Huang, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee,“Maximum a posteriori adaptation of network parameters in deepmodels,” arXiv:1503.02108 , 2015.[177] X. Xie, X. Liu, T. Lee, S. Hu, and L. Wang, “BLHUC: Bayesianlearning of hidden unit contributions for deep neural network speakeradaptation,” in
IEEE ICASSP , 2019.[178] D. Povey, “Discriminative training for large vocabulary speech recog-nition,” Ph.D. dissertation, University of Cambridge, 2005.[179] R. Price, K.-i. Iso, and K. Shinoda, “Speaker adaptation of deep neuralnetworks using a hierarchy of output layers,” in
IEEE SLT , 2014.[180] R. Caruana, “Multitask learning,”
Machine learning , vol. 28, no. 1, pp.41–75, 1997.[181] P. Swietojanski, P. Bell, and S. Renals, “Structured output layerwith auxiliary targets for context-dependent acoustic modelling,” in
Interspeech , 2015. [182] J. B. Allen and D. A. Berkley, “Image method for efficiently simu-lating small-room acoustics,”
The Journal of the Acoustical Society ofAmerica , vol. 65, no. 4, pp. 943–950, 1979.[183] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “Astudy on data augmentation of reverberant speech for robust speechrecognition,” in
IEEE ICASSP , 2017, pp. 5220–5224.[184] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, andM. Bacchiani, “Generation of large-scale simulated utterances in virtualrooms to train deep-neural networks for far-field speech recognition inGoogle Home,” in
Interspeech , 2017.[185] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview ofnoise-robust automatic speech recognition,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 22, pp. 745–777,2014.[186] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP)improves speech recognition,” in
ICML Workshop on Deep Learningfor Audio, Speech and Language , 2013.[187] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deepneural network acoustic modeling,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 23, no. 9, pp. 1469–1477, 2015.[188] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentationfor speech recognition,” in
Interspeech , 2015.[189] Y. Huang and Y. Gong, “Acoustic model adaptation for presenta-tion transcription and intelligent meeting assistant systems,” in
IEEEICASSP , 2020.[190] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous probabilistictransform for voice conversion,”
IEEE Transactions on speech andaudio processing , vol. 6, no. 2, pp. 131–142, 1998.[191] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deepconvolutional neural network acoustic modeling,” in
IEEE ICASSP .IEEE, 2015, pp. 4545–4549.[192] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving children’sspeech recognition through out-of-domain data augmentation.” in
In-terspeech , 2016, pp. 1598–1602.[193] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen,P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learningfrom speaker verification to multispeaker text-to-speech synthesis,” in
Advances in Neural Information Processing Systems , 2018, pp. 4480–4490.[194] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “Augmented cyclicadversarial learning for low resource domain adaptation,” in
ICLR ,2019.[195] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in
IEEE ICCV ,2017.[196] A. Ali, N. Dehak, P. Cardinal, S. Khurana, S. H. Yella, J. Glass,P. Bell, and S. Renals, “Automatic dialect detection in Arabic broadcastspeech,” in
Interspeech , 2016.[197] A. Ali, S. Vogel, and S. Renals, “Speech recognition challenge in thewild: Arabic MGB-3,” in
IEEE ASRU , 2017.[198] A. Ali, S. Shon, Y. Samih, H. Mubarak, A. Abdelali, J. Glass, S. Renals,and K. Choukri, “The MGB-5 challenge: Recognition and dialectidentification of dialectal Arabic speech,” in
IEEE ASRU , 2019.[199] P. Smit, S. R. Gangireddy, S. Enarvi, S. Virpioja, and M. Kurimo,“Aalto system for the 2017 Arabic multi-genre broadcast challenge,”in
IEEE ASRU , 2017.[200] S. Khurana, A. Ali, and J. Glass, “Darts: Dialectal arabic transcriptionsystem,” arXiv:1909.12163 , 2019.[201] D. Vergyri, L. Lamel, and J.-L. Gauvain, “Automatic speech recognitionof multiple accented English data,” in
Interspeech , 2010.[202] Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D. Jurafsky,R. Starr, and S.-Y. Yoon, “Accent detection and speech recognition forShanghai-accented Mandarin,” in
Interspeech , 2005.[203] L. W. Kat and P. Fung, “Fast accent identification and accented speechrecognition,” in
IEEE ICASSP , vol. 1, 1999, pp. 221–224.[204] M. Liu, B. Xu, T. Hunng, Y. Deng, and C. Li, “Mandarin accent adap-tation based on context-independent/context-dependent pronunciationmodeling,” in
IEEE ICASSP , vol. 2, 2000, pp. II1025–II1028.[205] U. Nallasamy, F. Metze, and T. Schultz, “Active learning for accentadaptation in automatic speech recognition,” in
IEEE SLT , 2012.[206] Y. Huang, D. Yu, C. Liu, and Y. Gong, “Multi-accent deep neuralnetwork acoustic model with accent-specific top layer using the KLD-regularized model adaptation,” in
Interspeech , 2014.[207] M. Chen, Z. Yang, J. Liang, Y. Li, and W. Liu, “Improving deepneural networks based multi-accent Mandarin speech recognition usingi-vectors and accent-specific top layer,” in
Interspeech , 2015.
VERVIEW PAPER SUBMITTED TO OJSP 29 [208] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training ofdeep neural networks,” in
IEEE ICASSP , 2013, pp. 7319–7323.[209] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato,M. Devin, and J. Dean, “Multilingual acoustic models using distributeddeep neural networks,” in
IEEE ICASSP , 2013, pp. 8619–8623.[210] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-languageknowledge transfer using multilingual deep neural network with sharedhidden layers,” in
IEEE ICASSP , 2013, pp. 7304–7308.[211] J. Yi, H. Ni, Z. Wen, and J. Tao, “Improving BLSTM RNN based Man-darin speech recognition using accent dependent bottleneck features,”in
IEEE APSIPA , 2016, pp. 1–5.[212] M. T. Turan, E. Vincent, and D. Jouvet, “Achieving multi-accent ASRvia unsupervised acoustic model adaptation,” in
Interspeech , 2020.[213] M. Elfeky, M. Bastani, X. Velez, P. Moreno, and A. Waters, “Towardsacoustic model unification across dialects,” in
IEEE SLT , 2016.[214] X. Yang, K. Audhkhasi, A. Rosenberg, S. Thomas, B. Ramabhadran,and M. Hasegawa-Johnson, “Joint modeling of accents and acousticsfor multi-accent speech recognition,” in
IEEE ICASSP , 2018.[215] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recogni-tion using accent embeddings and multi-task learning,” in
Interspeech ,2018.[216] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation forend-to-end CTC models,” in
IEEE SLT , 2018.[217] T. Viglino, P. Motlicek, and M. Cernak, “End-to-end accented speechrecognition,” in
Interspeech , 2019.[218] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domainadversarial training for accented speech recognition,” in
IEEE ICASSP ,2018, pp. 4854–4858.[219] B. Li, T. N. Sainath, K. C. Sim, M. Bacchiani, E. Weinstein, P. Nguyen,Z. Chen, Y. Wu, and K. Rao, “Multi-dialect speech recognition with asingle sequence-to-sequence model,” in
IEEE ICASSP , 2018.[220] M. Grace, M. Bastani, and E. Weinstein, “Occams adaptation: Acomparison of interpolation of bases adaptation methods for multi-dialect acoustic modeling with LSTMs,” in
IEEE SLT , 2018.[221] S. Yoo, I. Song, and Y. Bengio, “A highly adaptive acoustic model foraccurate multi-dialect speech recognition,” in
IEEE ICASSP , 2019, pp.5716–5720.[222] S. Ghorbani, A. E. Bulut, and J. H. Hansen, “Advancing multi-accentedLSTM-CTC speech recognition using a domain specific student-teacherlearning paradigm,” in
IEEE SLT , 2018, pp. 29–35.[223] A. Jain, V. P. Singh, and S. P. Rath, “A multi-accent acoustic modelusing mixture of experts for speech recognition,” in
Interspeech , 2019.[224] H. Zhu, L. Wang, P. Zhang, and Y. Yan, “Multi-accent adaptation basedon gate mechanism,” in
Interspeech , 2019.[225] G. I. Winata, S. Cahyawijaya, Z. Liu, Z. Lin, A. Madotto, P. Xu,and P. Fung, “Learning fast adaptation on cross-accented speechrecognition,” arXiv:2003.01901 , 2020.[226] Y. Long, Y. Li, H. Ye, and H. Mao, “Domain adaptation of lattice-free MMI based TDNN models for speech recognition,”
InternationalJournal of Speech Technology , 2017.[227] J. Fainberg, S. Renals, and P. Bell, “Factorised representations forneural network adaptation to diverse acoustic environments.” in
In-terspeech , 2017.[228] K. C. Sim, A. Narayanan, A. Misra, A. Tripathi, G. Pundak, T. N.Sainath, P. Haghani, B. Li, and M. Bacchiani, “Domain adaptationusing factorized hidden layer for robust automatic speech recognition.”in
Interspeech , 2018, pp. 892–896.[229] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size DNNwith output-distribution-based criteria.” in
Interspeech , 2014.[230] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in aneural network,” arXiv:1503.02531 , 2015.[231] J. Li, R. Zhao, Z. Chen et al. , “Developing far-field speaker systemvia teacher-student learning,” in
IEEE ICASSP , 2018.[232] L. Moˇsner, M. Wu et al. , “Improving noise robustness of automaticspeech recognition via parallel data and teacher-student learning,” in
IEEE ICASSP , 2019, pp. 6475–6479.[233] T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, and Y. Aono,“Domain adaptation of DNN acoustic models using knowledge distil-lation,” in
IEEE ICASSP , 2017, pp. 5185–5189.[234] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-studentlearning,” in
IEEE ICASSP , 2019, pp. 6445–6449.[235] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation bybackpropagation,” in
ICML , 2015, pp. 1180–1189.[236] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domainadaptation approach for robust speech recognition,”
Neurocomputing ,vol. 257, pp. 79 – 87, 2017. [237] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervisedadaptation with domain separation networks for robust speech recog-nition,” in
IEEE ASRU , 2017, pp. 214–221.[238] P. Denisov, N. T. Vu, and M. F. Font, “Unsupervised domain adaptationby adversarial learning for robust speech recognition,” in
SpeechCommunication; 13th ITG-Symposium , 2018, pp. 1–5.[239] M. Mimura, S. Sakai, and T. Kawahara, “Cross-domain speech recog-nition using nonparallel corpora with cycle-consistent adversarial net-works,” in
IEEE ASRU , 2017, pp. 134–140.[240] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “A multi-discriminator CycleGAN for unsupervised non-parallel speech domainadaptation,” in
Interspeech , 2018.[241] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang, “Cycle-consistent speechenhancement,” in
Interspeech , 2018.[242] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-studentlearning for unsupervised domain adaptation,” in
IEEE ICASSP , 2018,pp. 5949–5953.[243] Z. Meng, J. Li, Y. Gaur, and Y. Gong, “Domain adaptation via teacher-student learning for end-to-end speech recognition,” in
IEEE ASRU ,2019, pp. 268–275.[244] J. R. Bellegarda, “Statistical language model adaptation: Review andperspectives,”
Speech Communication , vol. 42, no. 1, pp. 93–108, 2004.[245] P. R. Clarkson and A. J. Robinson, “Language model adaptation usingmixtures and an exponentially decaying cache,” in
IEEE ICASSP ,vol. 2, 1997, pp. 799–802.[246] G. Tur and A. Stolcke, “Unsupervised language model adaptation formeeting recognition,” in
IEEE ICASSP , 2007.[247] X. Liu, M. J. Gales, and P. C. Woodland, “Context dependent languagemodel adaptation,” in
Interspeech , 2008.[248] R. Kuhn and R. De Mori, “A cache-based natural language modelfor speech recognition,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 12, no. 6, pp. 570–583, 1990.[249] ——, “Corrections to “a cache-based language model for speechrecognition”,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 14, no. 6, pp. 691–692, 1992.[250] M. Federico, “Bayesian estimation methods for n-gram language modeladaptation,” in
ICSLP , vol. 1, 1996, pp. 240–243.[251] M. Bacchiani and B. Roark, “Unsupervised language model adapta-tion,” in
IEEE ICASSP , 2003.[252] K. Seymore, S. Chen, and R. Rosenfeld, “Nonlinear interpolation oftopic models for language model adaptation,” in
ICSLP , 1998.[253] L. Chen, J.-L. Gauvain, L. Lamel, G. Adda, and M. Adda, “Usinginformation retrieval methods for language model adaptation,” in
Interspeech , 2001.[254] B.-J. P. Hsu and J. Glass, “Style & topic language model adaptationusing HMM-LDA,” in
EMNLP , 2006, pp. 373–381.[255] S. Huang and S. Renals, “Unsupervised language model adaptationbased on topic and role information in multiparty meetings,” in
Interspeech , 2008.[256] R. Kneser, J. Peters, and D. Klakow, “Language model adaptationusing dynamic marginals,” in
Fifth European Conference on SpeechCommunication and Technology , 1997.[257] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural prob-abilistic language model,”
Journal of Machine Learning Research ,vol. 3, pp. 1137–1155, 2003.[258] T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock`y, and S. Khudanpur,“Extensions of recurrent neural network language model,” in
IEEEICASSP , 2011, pp. 5528–5531.[259] X. Chen, T. Tan, X. Liu, P. Lanchantin, M. Wan, M. J. Gales, and P. C.Woodland, “Recurrent neural network language model adaptation formulti-genre broadcast speech recognition,” in
Interspeech , 2015.[260] S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrentneural network language model adaptation for multi-genre broadcastspeech recognition and alignment,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 27, no. 3, pp. 572–582, 2018.[261] K. Li, H. Xu, Y. Wang, D. Povey, and S. Khudanpur, “Recurrentneural network language model adaptation for conversational speechrecognition,” in
Interspeech , 2018, pp. 3373–3377.[262] T. Moriokal, N. Tawara, T. Ogawa, A. Ogawa, T. Iwata, andT. Kobayashi, “Language model domain adaptation via recurrent neuralnetworks with domain-shared and domain-specific representations,” in
IEEE ICASSP , 2018, pp. 6084–6088.[263] Y. Shi, M. Larson, and C. M. Jonker, “Recurrent neural network lan-guage model adaptation with curriculum learning,”
Computer Speech& Language , vol. 33, no. 1, pp. 136–154, 2015.
VERVIEW PAPER SUBMITTED TO OJSP 30 [264] S. R. Gangireddy, P. Swietojanski, P. Bell, and S. Renals, “Unsu-pervised adaptation of recurrent neural network language models.” in
Interspeech , 2016.[265] E. Grave, A. Joulin, and N. Usunier, “Improving neural languagemodels with a continuous cache,” in
ICLR , 2017.[266] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinelmixture models,” in
ICLR , 2016.[267] B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamic evalu-ation of neural sequence models,” in
ICML , 2018, pp. 2766–2775.[268] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin,F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingualcorpora in neural machine translation,” arXiv:1503.03535 , 2015.[269] E. McDermott, H. Sak, and E. Variani, “A density ratio approach tolanguage model fusion in end-to-end automatic speech recognition,” in
IEEE ASRU , 2019, pp. 434–441.[270] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregres-sive transducer (HAT),” in
IEEE ICASSP , 2020, pp. 6139–6143.[271] S. Ghorbani and J. H. Hansen, “Leveraging native language informationfor improved accented speech recognition,” in
Interspeech , 2018.[272] V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, “I-vector-basedspeaker adaptation of deep neural networks for French broadcast audiotranscription,” in
IEEE ICASSP , 2014, pp. 6334–6338.[273] M. Kim, Y. Kim, J. Yoo, J. Wang, and H. Kim, “Regularized speakeradaptation of kl-hmm for dysarthric speech recognition,”
IEEE Trans-actions on Neural Systems and Rehabilitation Engineering , vol. 25,no. 9, pp. 1581–1591, 2017.[274] M. Kitza, R. Schlter, and H. Ney, “Comparison of BLSTM-layer-specific affine transformations for speaker adaptation,” in
Interspeech ,2018, pp. 877–881.[275] C. Liu, Y. Wang, K. Kumar, and Y. Gong, “Investigations on speakeradaptation of LSTM RNN models for speech recognition,” in
IEEEICASSP , 2016, pp. 5020–5024.[276] H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, “Rapid speakeradaptation of neural network based filterbank layer for automaticspeech recognition,” in
IEEE SLT , 2018, pp. 574–580.[277] R. Serizel and D. Giuliani, “Deep neural network adaptation forchildren’s and adults’ speech recognition,” in
Italian Conference onComputational Linguistics , 2014.[278] P. Swietojanski and S. Renals, “Differentiable pooling for unsupervisedacoustic model adaptation,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 24, no. 10, pp. 1773–1784, 2016.[279] P. C. Woodland, X. Liu, Y. Qian, C. Zhang, M. J. Gales, P. Karanasou,P. Lanchantin, and L. Wang, “Cambridge University transcriptionsystems for the multi-genre broadcast challenge,” in
IEEE ASRU , 2015,pp. 639–646.[280] C. Zhang and P. C. Woodland, “DNN speaker adaptation using pa-rameterised sigmoid and ReLU hidden activation functions,” in
IEEEICASSP , 2016, pp. 5300–5304.[281] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL–2: Transforming Man-darin ASR Research Into Industrial Scale,” arXiv:1808.10583 , 2018.[282] J. Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything AMI meeting corpus,”
Language Resources andEvaluation , vol. 41, no. 2, pp. 181–190, 2007.[283] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, andM. Omologo, “Speaker independent continuous speech recognitionusing an acoustic-phonetic Italian corpus,” in
ICSLP , 1994.[284] N. Parihar, J. Picone, D. Pearce, and H. G. Hirsch, “Performanceanalysis of the Aurora large vocabulary baseline system,” in
EUSIPCO
IEEE ICASSP , 2003.[287] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “Ananalysis of environment, microphone and data simulation mismatchesin robust speech recognition,”
Computer Speech & Language , vol. 46,pp. 535–557, 2017.[288] K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evalu-ation,” in
ISCA & IEEE Workshop on Spontaneous Speech Processingand Recognition , 2003.[289] G. Gravier, G. Adda, N. Paulsson, M. Carr, A. Giraudel, and O. Galib-ert, “The ETAPE corpus for the evaluation of speech-based TV contentprocessing in the French language,” in
LREC , 2012.[290] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff,“HKUST/MTS: A very large scale Mandarin telephone speech corpus,”in
Chinese Spoken Language Processing , Q. Huo, B. Ma, E.-S. Chng,and H. Li, Eds., 2006, pp. 724–735. [291] P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu,A. McParland, S. Renals, O. Saz, M. Wester, and P. C. Woodland, “TheMGB challenge: Evaluating multi-genre broadcast media recognition,”in
IEEE ASRU
IEEEICASSP , 1992, p. 517520.[294] M. Cettolo, C. Girardi, and M. Federico, “WIT3: Web inventory oftranscribed and translated talks,” in
EAMT , 2012, pp. 261–268.[295] A. Rousseau, P. Delglise, and Y. Estve, “TED-LIUM: an automaticspeech recognition dedicated corpus,” in
LREC , 2012.[296] ——, “Enhancing the TED-LIUM corpus with selected data for lan-guage modeling and more TED talks,” in
LREC , 2014.[297] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett,N. L. Dahlgren, and V. Zue, “TIMIT acoustic phonetic continuousspeech corpus,”
Linguistic Data Consortium, LDC93S1 , 1993.[298] D. B. Paul and J. Baker, “The design for the Wall Street Journal-basedCSR corpus,” in
Speech and Natural Language Workshop , 1992.[299] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani,M. Gerosa, C. Hacker, M. Russell, S. Steidl, and M. Wong, “ThePF STAR children’s speech corpus,” in
Interspeech , 2005.[300] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: anASR corpus based on public domain audio books,” in
IEEE ICASSP ,2015, pp. 5206–5210.[301] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advances in Neural Information Processing Systems , 2017.[302] P. Karanasou, C. Wu, M. Gales, and P. C. Woodland, “I-vectors andstructured neural networks for rapid adaptation of acoustic models,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 25, no. 4, pp. 818–828, 2017.[303] A.-r. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hinton,and M. A. Picheny, “Deep belief networks using discriminative featuresfor phone recognition,” in
IEEE ICASSP . IEEE, 2011, pp. 5060–5063.[304] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial forspeaker adaptation of connectionist speech recognition systems,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 21,no. 10, pp. 2152–2161, 2013.[305] O. Abdel-hamid and H. Jiang, “Rapid and effective speaker adaptationof convolutional neural network based models for speech recognition,”in
Interspeech , 2013.[306] P. Swietojanski and S. Renals, “Differentiable pooling for unsupervisedspeaker adaptation,” in
IEEE ICASSP , 2015.[307] P. Swietojanski and S. Renals, “SAT-LHUC: Speaker adaptive trainingfor learning hidden unit contributions,” in
IEEE ICASSP , 2016.[308] N. Tomashenko, Y. Khokhlov, A. Larcher, and Y. Est`eve, “ExploringGMM-derived features for unsupervised adaptation of deep neuralnetwork acoustic models,” in
International Conference on Speech andComputer , 2016.[309] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein,and K. Rao, “Multilingual speech recognition with a single end-to-endmodel,” in
IEEE ICASSP , 2018, pp. 4904–4908.[310] A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran,Y. Wu, A. Bapna, Z. Chen, and S. Lee, “Large-scale multilingualspeech recognition with a streaming end-to-end model,” in
Interspeech ,2019, pp. 2130–2134.[311] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Massivelymultilingual adversarial speech recognition,” in
NAACL/HLT , 2019.[312] V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Syn-naeve, and R. Collobert, “Massively multilingual ASR: 50 languages,1 model, 1 billion parameters,” arXiv:2007.03001 , 2020.[313] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli,“Unsupervised cross-lingual representation learning for speech recog-nition,” arXiv:2006.13979 , 2020.[314] K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. v. d.Oord, “Learning robust and multilingual speech representations,” arXiv:2001.11128 , 2020.[315] S. Pascual, M. Ravanelli, J. Serr`a, A. Bonafonte, and Y. Bengio,“Learning problem-agnostic speech representations from multiple self-supervised tasks,” in
Interspeech , 2019.[316] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro,J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robustspeech recognition,” in