[PDF] Realizing Petabyte Scale Acoustic Modeling

Abstract

Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data. Learning an AM from 1 Million hours of audio presents unique ML and system design challenges. We present the design and evaluation of a highly scalable and resource efficient SSL system for AM. Employing the student/teacher learning paradigm, we focus on the student learning subsystem: a scalable and robust data pipeline that generates features and targets from raw audio, and an efficient model pipeline, including the distributed trainer, that builds a student model. Our evaluations show that, even without extensive hyper-parameter tuning, we obtain relative accuracy improvements in the 10 to 20 % range, with higher gains in noisier conditions. The end-to-end processing time of this SSL system was 12 days, and several components in this system can trivially scale linearly with more compute resources.

Full PDF

JJOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 1

Realizing Petabyte Scale Acoustic Modeling

Sree Hari Krishnan Parthasarathi, Nitin Sivakrishnan, Pranav Ladkat, Nikko Strom

Abstract —Large scale machine learning (ML) systems such asthe Alexa automatic speech recognition (ASR) system continue toimprove with increasing amounts of manually transcribed train-ing data. Instead of scaling manual transcription to impracticallevels, we utilize semi-supervised learning (SSL) to learn acousticmodels (AM) from the vast ﬁrehose of untranscribed audio data.Learning an AM from 1 Million hours of audio presents uniqueML and system design challenges. We present the design andevaluation of a highly scalable and resource efﬁcient SSL systemfor AM. Employing the student/teacher learning paradigm, wefocus on the student learning subsystem: a scalable and robustdata pipeline that generates features and targets from raw audio,and an efﬁcient model pipeline, including the distributed trainer,that builds a student model. Our evaluations show that, evenwithout extensive hyper-parameter tuning, we obtain relativeaccuracy improvements in the 10 to 20 % range, with highergains in noisier conditions. The end-to-end processing time ofthis SSL system was 12 days, and several components in thissystem can trivially scale linearly with more compute resources. Index Terms —Speech recognition, acoustic models, large scalesemi-supervised learning, machine learning.

I . I

N T R O D U C T I O N M ODERN machine learning (ML) relies on large-scaledata; for hard problems more data consistently lead tobetter models. In the ﬁeld of automatic speech recognition,there is a well known maxim: there is no data like moredata [1]. This has naturally led to the use of ever increasingamounts of speech data. From tens of hours of speech (TIDIG-ITS [2], TIMIT [3], WSJ [4]), to hundreds (Switchboard [5]),and thousands (Fisher corpus [6]). Recently, training data sizeson the order of ten thousand hours of speech are not unusual([7], [8]), and while building an AM from a hundred thousandhours has still been uncommon, [9] showed that increasingfrom several thousand hours to a hundred thousand hours oflightly supervised speech data can improve speech recognitionaccuracy signiﬁcantly.This increase in the amount of data in ASR and other MLﬁelds requires ever more efﬁcient data and ML processingpipelines. A rich infrastructure has emerged – commerciallyand in Open Source communities – to serve both hardwareand software requirements of large-scale ML. There has beenextensive developments in powerful ML toolkits (Spark MLLib,mlpack, Scikit-Learn), ML cloud services (AzureML, AmazonML, Google Cloud Machine Learning), distributed storage (S3,HDFS [10], GoogleFileSystem), frameworks for distributedcompute (Hadoop [10], Spark [11]) and Deep Learning(PyTorch [12], MxNet [13], TensorFlow [14]). These toolsmake it easy to train ML models. However, for the largest,

All authors are employed by Amazon.com, Inc., { sparta | nitins | ladkat | nikko } @amazon.com most data intensive ML systems it is critical to use generictools efﬁciently, and sometimes develop custom solutions, toavoid memory, bandwidth or processing bottlenecks. In thispaper we present our end-to-end ML system for building anacoustic model based on 1 Million hours of speech in 12 days.This is one order of magnitude more speech data than has beenreported in any previously published work in ASR [9].While many of the techniques and tradeoffs are applicableto other ﬁelds, we demonstrate the large-scale challenge hereby tackling acoustic model (AM) training for ASR. The AMin the ASR system takes a sequence of acoustic feature vectorsas input and produces a sequence of phonetic probabilitydensities as output [15]. At each time step, it produces aposterior probability for each phonetic class. Because of thelarge variability in speaker characteristics, dialects, accents,and acoustic environments, producing a highly accurate AMrequires large amounts of speech data, and the model istypically trained as a phonetic classiﬁer with supervision fromannotated speech data. Thus, for each training speech utterance,a human annotator listens and assigns the correct spoken words.However, at the scale of 1 Million hours of speech, humanannotation becomes impractical. Both the cost and the logisticsrequired to manage such a large undertaking in a reasonableamount of time are prohibiting. Therefore, here we use semi-supervised training, where only a fraction of the speech datais annotated.To characterize the scale of the task, 1 million hours isequivalent to a constant stream of speech, 24/7, for 114 years –far more than any human will hear in a lifetime. The number ofutterances is on the order of one billion, and since we extracthundreds of feature values for every 30 milli second of eachutterance, our models process over 1 trillion attributes duringtraining. In terms of raw size for storage, for the commonlyused 256 kbps wav formats for audio, 1 million hours ofaudio uncompressed requires 107 TB of disk space for storage.Building a robust data transformation and training pipeline thathandles such data sizes has required close attention to, amongother considerations, the failure modes of underlying software,machines, disks and networking channels; limitations of clustercomputing frameworks; theoretical algorithmic scaling bottle-necks and monetary cost of resources utilized.For the SSL system discussed in the paper, we employedthe student/teacher learning paradigm, speciﬁcally focusingon the student learning subsystem. We factored the studentlearning subsystem into two pipelines: data preparation andmodel training. Each has unique scaling challenges. Asidefrom managing a cluster of compute nodes and distributeddata storage, the data preparation step performs shufﬂing andnormalization that is non-trivial in a large scale distributed a r X i v : . [ c s . S D ] A p r OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 2 system. The model training on the other hand must addressthe challenge of large-sale data-parallel neural network training.This is a fast evolving ﬁeld that has received signiﬁcantattention in recent years ([16], [17], [18], [19], [20], [21]).The remainder of the paper is organized as follows: In thenext section we give an overview of the system, followed bysection III about the computing infrastructure. In sections: IV,V, and VI we give more details about the most prominentcomponents of the system, and ﬁnally section VIII reports ourresults and IX concludes.I I . O

V E R A L L S Y S T E M

A. Background on the ASR System1) Signal processing system:

The Alexa family of de-vices [22] use an array of microphones arranged in differ-ent geometries. The signal processing system takes the rawchannels and the playback channel as input and returns asingle signal for downstream processing by the wake worddetector ([23] [24]) and the ASR system ([25], [26]). Thissystem uses beamforming (BF) algorithms to emphasize speechfrom a desired direction while suppressing audio interferencefrom other directions, and an acoustic echo canceler (AEC) toremove the playback channel from each of the beams.

2) Acoustic and Language Models:

The input to the ASRsystem used in this work is an audio signal, from which wederive a sequence of ﬁxed size acoustic vectors ( X T = x , ..., x T ). The ASR system solves the problem of ﬁndingthe most likely sequence of words ( W M W = w , ..., w M W )given the sequence of acoustic vectors ˆ W ∝ arg max W p ( X | W ; Θ AM ) κ (cid:124) (cid:123)(cid:122) (cid:125) AM P ( W ; Θ LM ) (cid:124) (cid:123)(cid:122) (cid:125) LM , (1)where Θ AM and Θ LM are the free parameters of the acousticand language model, and κ balances the impact of the acousticmodel against the language model. The AM is based on thestandard HMM/deep learning hybrid, and we summarize detailsrelevant to this paper in Section II-B. Other aspects of thissystem have been described elsewhere ([25], [26], [27], [28]).The LM [29] estimates the a priori probability that thespeaker will utter a sequence of words. The decoder is anFST based decoder using an optimized dynamic compositionapproach. In this work, we use a large n-gram model.

B. Fully Supervised Acoustic Model

We use an HMM-LSTM hybrid. The HMM models low-frame rate single state triphone units [30]. States are clustereddown to 3,183 senones using phonetic decision trees. Theacoustic features consist of 64-dimensional log mel-warpedenergies computed on audio signals every 10 ms with a 25ms analysis window. These are stacked three at a time andsub-sampled to a 30 ms advance. A causal mean estimate iscomputed and subtracted, and ﬁnally global mean and variancenormalization is applied. To compensate for sub-sampling,features are created at three different offsets for each utterance.The LSTM model is a stack of ﬁve layers, each consistingof 768 units resulting in about 24 M parameters. The modelhas a three-frame look-ahead. The training data is 7,000 hours of labeled US English data. The models are trained ﬁrst withthe cross-entropy criterion (CE), using alignments computedon the labeled data. First, we follow an exponential learningrate decay for ten epochs, with chunked BPTT for greaterparallelization efﬁciency [31]. In this technique, utterances aresplit into smaller sub-sequence chunks (here, 32 frames) andthe sub-sequences are randomized. For each epoch we cyclethrough a different feature offset. Then the models are ﬁne-tuned using full sequence CE BPTT for two more epochs.Finally, three epochs of the sequence discriminative criterionstate-level minimum Bayes risk (sMBR) is applied [32].

C. Semi-Supervised Learning

SSL has a long history in ASR ([33], [34], [35]). Self-training is the most commonly used approach where typicallythere is a smaller labeled dataset, and a much larger unlabeleddataset. The labeled data is used to train a seed model from apowerful model family, which is used to decode the unlabeleddata at the second stage (often large beam sizes are used).The most reliable hypotheses are selected based on conﬁdencemeasures [36] and the speech data with the selected hypothesesare used for re-training the AM.

Transcribed speech Supervised trainingGTC Teacher model

Feature

Extraction

Target

Generation Targets

Features

Fig. 1:

SSL Teacher Subsystem: Training the teacher model ontranscribed data. We also show the distributed trainer “GTC”,of which we will discuss more in Section VI.1) Overview of the SSL System:

Our approach to SSL wasto employ the student/teacher learning paradigm, which avoidsexplicitly modeling conﬁdence scores, thus taking the ASRdecoder out of the SSL recipe. The full teacher and studentsubsystems are shown in Figures 1 and 2. The teacher andstudent models output probability distributions over senones;the learning objective optimizes the CE loss between thesetwo distributions. The teacher model is not bound by the sameconstraints as the runtime system where the student will bedeployed. Therefore, we can use a more powerful model forthe teacher. In our case we used a bidirectional LSTM modelwhich has access to both the past and future audio. Note thatthis teacher model is more accurate than a production systemwhere live audio is streamed and we cannot use informationfrom the future. We can also use a larger teacher model withoutincurring compute cost or latency in the live production system.Apart from the difference in model family, the training of the

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 3

Un-transcribed speech Inference Targets SSLBMUFTeacher model Student modelFeature Extraction Features

SSL data pipelineSSL model pipeline

Fig. 2:

SSL Student Subsystem: Training the student model on 1Mhr of untranscribed data using student/teacher methodology.This ﬁgure also illustrates how the data and model pipelinesubsystems ﬁt in the overall system. Also shown in the ﬁgureis the distributed trainer “BMUF”, of which more will bedescribed in Section VI. teacher on the labeled data follows the same recipe as theregular LSTMs, discussed in Section II-B.We experimented with many different ways to utilize boththe annotated and un-annotated data. The ﬁnal training recipe isperiodically inserting the annotated data to the student traininginterspersed with the un-annotated data. We refer to this methodas scheduled learning . However, we use only the annotateddata in the last steps of the training recipe. Those steps usesequence training which is more sensitive to label errors. Afuller discussion of these modeling challenges and the solutionsis presented in [37], and we summarize this in Section V-D.The focus of this paper is a description of the systemlevel challenges in realizing this large scale system. Over thenext few sections, we summarize these challenges, outline theresources at our disposal and elaborate the design tradeoffs.

2) SSL System Challenges:

In Section II-C1 we discussedthe factorization of the SSL system into teacher and studentpipelines. The actual realization of the system is more involved;for simplicity of discussion we factor it into two main compo-nents: data pipeline and model pipeline. Each has their ownscaling and efﬁciency challenges.

SSL Data Processing Challenges

The major data processingchallenges for the SSL pipeline are scalability and robustness.Some computing steps in AM data preparation and trainingare inherently sequential, and others require cluster inter-worker data transfer; these introduce delays in the processingsystem. Another challenge is the granularity of a computingstep: smaller steps increase the degree of parallelism, butfails to take advantage of joint optimization between steps.In addition for large scale systems, with source data over100 TB (alternatively, a billion audio ﬁles), and with tasksdistributed over several thousand computing cores, the systemmust be resilient against temporary host and process failures,data corruption and network timeout issues. Finally, to makethe system viable in practice, it is necessary to do so withintight time and infrastructure budgets.

SSL Modeling Challenges

For the modeling subsystem,accuracy and scalability are the most important challenges. From an accuracy standpoint, data selection and ﬁltering forSSL is a question to consider; several sampling strategieshave been previously suggested [38]. In addition, SSL withself-training requires good conﬁdence measures, with severalprevious proposals ([36], [39]). Another challenge with modelswhich have high memorization capability such as LSTM AMsis that label quality becomes even more important [8]. A furtherchallenge is applying sequence discriminative training, wherelabel errors have a larger detrimental effect ([40], [41]). From ascalability standpoint, for the scale of data we consider in thispaper, an efﬁcient inference mechanism to generate trainingtargets is an important constraint. Also the model training mustaddress the challenge of large-sale data-parallel neural networktraining.I I I . M L T

R A I N I N G I N F R A S T R U C T U R E

In this section we describe our training infrastructure interms of compute, storage, and the ML software.

A. Elastic Computing

Amazon Elastic Compute Cloud (Amazon EC2) is a webservice that provides secure, resizable compute capacity in thecloud. Our pipelines were deployed on EC2 instances. EC2provides a wide selection of instance types optimized to servedifferent types of workloads. Instance types comprise varyingcombinations of CPU, memory, storage, and networking capac-ity and give the ﬂexibility to choose the appropriate mix ofresources for our jobs. We used three different instance typesacross our pipelines depending on the characteristics of thetask: • P3: p3.16xlarge hosts are General purpose GPU instances.The instance type hosts 8 NVIDIA Tesla V100 GPUdevices per worker. We used P3s for steps that could takeadvantage of them, such as model training and inferenceto generate targets. • X1: x1.32xlarge machines are dense compute instancetypes, with 128 cores in their Intel Xeon E7-8880 v3 pro-cessors and up to 1,952 GB of DRAM-based memory. Weused these hosts for computing steps that have signiﬁcantdata transfer between workers in a cluster. Using X1sreduces the amount of data being exchanged across thenetwork between the workers. • C4: c4.4xlarge instance types have 16 Intel Xeon Plat-inum processors and were used for highly parallel andindependent steps in the pipeline, where computing speedis the main concern.

B. Distributed Storage

AWS S3 is a highly available, durable, and scalable key-valuedata store that does not pose practical limitations on objectsizes. S3 was used as the distributed store for audio, featuresand targets produced by the pipeline, and for the teacher andstudent models during training. We use three APIs/operationsto access objects in S3 throughout our pipelines:

PUT Object , GET Object , and

HEAD Object . Our data layout and clusterconﬁguration were driven by the following considerations:

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 4 • S3 has an overhead of around 200 milliseconds on reusedconnections for the three operations. • Provided that we are under the network throughput limitof an instance type, and S3 buckets are appropriatelypartitioned, the average bandwidths for

PUT Object and

GET Object operations are 50 MB/s.

C. Cluster Computing

Apache Spark [11] is an open source cluster computingframework that provides a concise programming model forprocessing large datasets with implicit parallelism and faulttolerance. Central to the Spark programming model is theconcept of a Resilient Distributed Dataset (RDD) – a distributedcollection of records spread over many partitions. Sparkprovides abstractions to operate on RDDs and the unit ofparallelism is an RDD partition. A common pattern whenusing Spark is to integrate data loading and staging fromdistributed ﬁle system implementations. However, we usedSpark for its fault tolerant model in our iterative algorithms,but relied directly on S3 to persist data for jobs. In our Sparkjobs, an RDD starts with a list of objects in S3 that serveas the unit of parallelization. An alternative would have beento use an S3 based implementation of Hadoop’s distributedﬁle system interface, which would internally call different S3operations to load records from S3 ﬁles into the RDD. However,to scale effectively, relying on Spark and S3 directly was aearly design decision in this work. There are a few fundamentaltransformations permitted on RDDs such as map, ﬁlter, join,and reduce. A key observation here is that operations thatperform a shufﬂe of the elements or involve a repartition ofthe RDD, involve data movement across workers and causebottlenecks when working with large datasets.For scheduling Spark jobs we use the stand-alone Sparkdefault FIFO scheduler: by design our ﬁles were of roughlyequal size and we have a far greater number of ﬁles thanCPU cores; so, the default scheduler works well. However, thedefault scheduler introduced complexity in target generationwhere we need to schedule tasks on GPUs, instead of on CPUs.For such tasks we implemented a custom scheduler.

D. Machine Learning Software

We made use of the open source ASR toolkit, Kaldi [42]for extracting features from audio, computing and applyingnormalization techniques, and to serialize data derived fromutterances. We use an in-house distributed deep-learning train-ing toolkit [16] for training the acoustic models that has beenoptimized for the task. We investigated two types of distributedtraining: Gradient Threshold Compression (GTC) [16] andBlockwise Model Update Filtering (BMUF) [21]. We discussthis in more detail in Section VI.I V. T

H E D ATA P I P E L I N E

For building acoustic models in the large SSL framework,features computed from the audio and their correspondingmachine generated targets need to be generated in a scalableand robust fashion. This involves composing a sequence of several highly scalable steps, which we refer to as the datapipeline. Spark provides resiliency and fault tolerance neededto run some of these steps whose execution time span days.Further, we use S3 as the backing store for some of theintermediate artifacts. We built the pipeline iteratively to aidrapid proﬁling, ﬁne-tuning, and experimental turnaround. Thepipeline is designed to be elastic so that it can scale to evenlarger datasets.

A. Design Principles

For a system that operated at this scale, the design wasa crucial element. The design of the pipeline follows theseprinciples.

1) Optimize input ﬁles for parallelism early in the pipeline:

Our ﬁrst step determines how many ﬁles the pipeline willconsume, and how data will be organized within the ﬁlesto optimize processing in the future steps. For example, byaggregating and sharding audio based on speaker early in thepipeline, we were able to avoid costly data transfer in latersteps, such as in the feature normalization by speaker algorithmthat occurs during feature extraction. The trade-off involved inﬁxing the ﬁle partitions early is that we limit the ﬂexibility ofreusing intermediate data artifacts for other experiments andpipelines.

2) Avoid distributed algorithms that have high inter-nodeI/O:

Spark group-by, join, and shufﬂe operations are I/Ointensive, requiring heavy data transfer between nodes in thecluster and we avoid them where possible. Consequently, wedo not perform utterance level shufﬂe operations to providefeature randomization as we had run into its scaling limitationseven when working with smaller data sets. Instead we shufﬂeelements hierarchically during feature ﬁle generation. Thisresults in features being less uniformly randomized acrossour all our ﬁles. However we did not observe a degradatationin accuracy improvements when increasing the sizes of ourdatasets in our experiments using the heirarchical shufﬂe.

3) Aggregate to nearly equal sized ﬁles:

Working withlarger aggregated ﬁles reduces the total number of ﬁles; thisreduces the number of S3 interactions each of which hasan unavoidable latency overhead. The nearly equal ﬁle size property ensures that Spark’s default FIFO job schedulerprovides a near optimal processing time. We note that ﬁlesbeing our unit of parallelization, reducing the total number ofﬁles results in the us limiting the ceiling of parallelization, butin practice we are not close to this limit.

4) Perform mini-batch operations locally in a CPU/GPU:

Most pipeline steps rely on invoking local, external processesper CPU/GPU on audio data to perform a transformationfunction. By operating on mini-batches instead of individualdata elements, we reduce resource contention and amortize theprocess start time over the mini-batch. In addition, mini-batcheson GPUs take advantage of low level GPU parallel primitivesgiving a further performance boost. The trade-off with batchesis the increased software complexity in implementing batchinterfaces and handling failures during processing of individualelements in a batch.

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 5

5) Consolidate steps:

At this scale, the aggregate S3 opera-tions for a pipeline step can take substantial time. By reducingthe total number of steps, we eliminate entire ﬁle set transfers,and in turn reduce end-to-end time. However, this involvesa cost trade-off in merging steps. If steps that can leveragecheaper compute are merged with a step that requires moreexpensive compute, the entire set of merged steps may nowneed to execute on the more expensive compute nodes. Still,we collapsed several steps in feature extraction to a single stepwith signiﬁcant savings.

B. Steps in the Data Pipeline

Following the above principles, we implemented a pipelineto produce the data set for the 1 M hour training. It consistsof the following steps.

1) Data selection:

Prior to any job being run, all potentialaudio ﬁles are stored as individual ﬁles in S3. Metadata asso-ciated with these individual utterances such as an anonymizedspeaker id, duration, creation timestamp, locale, and the S3location of the utterance ﬁles are stored as JSON records inaggregated ﬁles in S3. The aggregated ﬁles are partitionedvertically on time interval and horizontally on speaker id. Sinceour storage systems could have utterances whose duration inaggregate would be more than a million hours, our ﬁrst stepwas to select utterances for training. We followed a simplestrategy for this across all data: over the period spanningseveral months, we uniformly selected 1 million hours of audio.The random selection is performed by ﬁrst loading the entireutterance metadata set for the duration into an RDD followedby sampling records in the RDD using a mild oversamplingstrategy based on average utterance length statistics. Onceselected, we stored relevant metadata in S3. The output ﬁlesfrom this step were stored as nearly equal 200,000 ﬁles in S3(about ﬁve hours each). Data from a speaker is grouped intothe same ﬁle; we retain this partitioning scheme for the rest ofthe steps in the pipeline. Although this is a step that does needinter-node data transfer for the grouping, the total data beinghandled is much smaller (200 GB uncompressed); it ﬁts into asingle X1 instance and poses no immediate scaling bottleneck.

2) Feature extraction:

For each utterance stored across the200,000 ﬁles from the previous step, we fetch audio staged inS3 and group them based on speaker. We then sort this audio byits creation timestamp to provide an audio stream per speaker;we extract features for every frame, one speaker at a time. Thealgorithm also performs a causal mean normalization over allthe audio for a speaker. The total size of the features at thisstage is around 135 TB. For training deep learning models,features need to be shufﬂed across the entire ﬁle set. Wedeveloped a hierarchical shufﬂing strategy: that both shufﬂesthe global ordering of the ﬁles and the features correspondingto the utterances within each of the 200,000 ﬁles. We thennormalize features using the global statistics: zeroth, ﬁrst andsecond order statistics. Statistics for each shufﬂed ﬁle arecomputed in parallel, and then aggregated across ﬁles. Finally,we subsample the features to a lower frame rate (33Hz), whileensuring no information loss by splicing subsampled frames.

3) Target generation:

We use a trained teacher modelto generate training targets for the student model. Targetgeneration is more efﬁcient on GPUs with batching. As thestand alone spark scheduler we used in our data pipeline couldnot efﬁciently distribute tasks in multi-GPU machines, weimplemented a custom local scheduler for this step to distributethe load on all available GPUs in the cluster.

4) Repartition targets and features:

The ﬁnal step of thedata pipeline is to repartition the ﬁles to nearly 500,000partitions (each with about 2 hours of audio) to improve theefﬁciency of distributed model training.

C. Implementation Considerations

The pipeline consists of four steps that starting from audiodata in S3 executes sequentially to produce inputs for training.Each step is a cluster job that runs on a ﬁxed pool of identicalEC2 instances that takes as input, the output from one or moreof the previous steps and a cluster conﬁguration. Inputs andoutputs are lists of objects in S3. The steps are modular andcan be invoked independently given the inputs required forthat step. Apart from advantages of modularity and logicalseparation of concerns, another reason for this design is thelarge variance in compute resources for individual steps; thusseparating the steps based on the computing needs per stepallowed us to be more efﬁcient. Cluster conﬁguration for eachstep includes: • EC2 Instance types and the number of instances of thattype to be used. • S3 resource permissions for read and write operations onS3 objects. • Software dependencies for the job, each of which getsdeployed on the entire Spark cluster including the masternode and the slave nodes. • Spark conﬁguration for the job such as the number ofexecutors per node, number of cores to allocate perexecutor and the Java Virtual Machine(JVM) settings foreach executor.V. T

H E M O D E L P I P E L I N E

A few design decisions were critical not just for performance,but also for relatively fast experiment turnaround time as wellas to be able to build a model on 1 Mhrs efﬁciently. In thissection we report the key ML design choices in our system.

A. Student-Teacher Learning

We used the student/teacher learning methodol-ogy [43], [44], [45], [37] thus simplifying the SSL modelingrecipe and eliminating the need for a full ASR decoder. Foreach feature vector, the teacher and the student networkscompute posterior probability distributions over senones.While the teacher parameters are frozen, the student network’sparameters are estimated by minimizing the cross-entropy lossbetween the two posterior distributions. The student networkstructure is identical to the LSTM AM described in theSection II, but the teacher networks have ﬁve bi-directionalLSTM layers, each with 768 units (totaling 78 M model

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 6 parameters) – this is nearly 3 times the size of the studentnetwork. The training of the teacher network on labeled datauses the same recipe as the regular LSTMs.

B. Conﬁdence Modeling

It has been reported previously that modeling even withunﬁltered data can lead to signiﬁcant WER improvements inthe context of SSL ([34], [46]). Furthermore, as neural networktechnology has improved, so have the estimated probabilitiesbecome better calibrated [47]. Our hypothesis is that theteacher’s posteriors are well enough calibrated to act as theconﬁdence measure for student network training. However,in a traditional self-training system, the language model alsoprovides additional information during decoding; in our SSLsystem the LM is not present. We speculate that this is partiallymitigated by using bi-directional teacher LSTM models, whichobserves more context than the student to make a frame leveldecision.

C. Target Generation

Since the senone output distribution is large (a 3,183 dimen-sional vector), generating targets using the teacher model on-the-ﬂy slows down training. As we parallelize training acrossmultiple GPUs, to reduce network bandwidth and to minimizestorage, we store only the k highest valued logits. During thestudent network training, full posteriors are reconstructed; themissing logits are ﬁlled with large negative values. While thereconstruction procedure is lossy, our experiments showed thatprobability mass is dominated by a few top posteriors. Thehyper parameter k is selected empirically based on the teachermodel family, its model structure, and the amount of data ithas been trained on. D. Scheduled Learning

Although most of the data used for training is unlabeled,we found that using the limited labeled data can be useful inobtaining performance improvements. Our learning algorithminterleaves parameter estimation on unlabeled and labeled data,with slightly higher learning rates on the labeled data. Giventhe size of unlabeled data, our design was to perform justone epoch through it while visiting the labeled data multipletimes. We divided the unlabeled data into a ﬁxed number of sub-epochs , with a sub-epoch deﬁned as 55,000 hours. Wedecayed the learning as we ingested unlabeled data throughthe sub-epochs, following an exponential learning rate decay.After each sub-epoch through the unlabeled data, we performCE training on the labeled data, with a rotation through thefeature offsets (please refer to Section II). As described inSection II we employ sequence chunked BPTT for trainingspeed; we apply chunked training for the ﬁrst 15 sub-epochs,and then perform ﬁne-tuning during the last three sub-epochs.

E. Sequence Training for SSL

Sequence discriminative training of a deep learning AMoften yields large WER improvements (commonly, around 10 % relative [48]). However, discriminative training is a difﬁcult problem for SSL ([49], [40]), since the discriminative lossfunction can be sensitive to noisy references during training.Our decision was to perform sMBR training of the studentmodel only on labeled data. Previous work [50] indicates thatthe accuracy gains may be relatively small in such a setup.However, we hypothesized that this result was likely due to arelatively small labeled dataset, and using the full 7,000 hourlabeled data in this study could still yield large gains fromsequence training. F. Distributed Training

Identifying a good approach to performing distributed train-ing of the student model was a key element of our design,exploring the tradeoff between scalability and accuracy. Westudied Gradient Threshold Compression (GTC) [16] andBlockwise Model Update Filtering (BMUF) [21], of whichwe discuss more in Section VI.V I . D

I S T R I B U T E D T R A I N I N G

For large-scale neural network training, distributing the work-load across many GPUs is required to produce a trained modelin a reasonable time. Here we will discuss data-parallelism,where different worker nodes process different input data, butshare the model that is trained. The widely used stochasticgradient descent (SGD) opimization technique (e.g. [15], [51])has a serial aspect to it that makes it challenging to scale SGDto large number of workers. Let’s assume that the standardtechnique of using a “mini-batch” is used. A mini-batch is asmall batch of data-points that are processed efﬁciently on asingle GPU. Basic data-parallelism then involves computingmini-batches on all workers, aggregating their gradients andupdating the shared model. However, by increasing the numberof compute nodes, the effective (aggregated) mini-batch size isincreased linearly, which has shown to produce lower accuracyon the validation and test datasets, and generally reducesmodel convergence rate [52], [53]. Several techniques can beemployed, such as adjusting learning rate [54], [52], usinga warm-up phase (e.g., [16]), etc. which can increase anupper bound on the workable effective mini-batch size, butfundamentally there still remains an obstacle of scaling tolarge GPU clusters.A speciﬁc scaling challenge is communicating gradientsbetween workers which requires high bandwidth and as thesize of model or the number of workers are increased, this canlead to a severely communication-bounded algorithm. This alsodepends on whether workers are on the same host or different,since the cost of communication between different hosts istypically much higher. In case of larger number of workers,the cost of communication can thus dominate the total trainingtime. In particular for cases where the ratio of compute-time tocommunication is low, having high bandwidth interconnect isthen necessary to reduce overall time spent in communication.Empirically we ﬁnd that with high-end compute nodes suchas AWS p3.16xlarge which contains 8 Nvidia V100 GPUsand provides data transfers upto 300 Gbps across peer GPUson the same host, data parallel SGD can scale almost linearlywithin a single host. Algorithms such as ring-allreduce [55]

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 7 and hierarchical ring-allreduce [56] are used which aim toutilize available bandwidth optimally among compute devices.However due to the limited bandwidth of 25 Gbps across hostson these instances, the scalability of the training beyond singlehost is severely affected.Several techniques have been developed to reduce the band-width required in gradient communication. The early works of[17] and [16], that introduced the quantization and compressionby gradient thresholding, have later been reﬁned and used inmany contexts, e.g. [18], [19], [20]. These techniques reducethe amount of data which needs to be communicated betweenworkers to reduce overall communication time. This increasesthe limit on the number of GPU worker nodes that can be usedin parallel without rendering the algorithm communication-bound.Another well-known general technique to scale distributedtraining is based on the concept of Model Averaging (MA).In this technique each worker updates a local model based onmany mini-batches from the dataset without communicatingwith other workers. The model is only synchronized acrossworkers after some interval of time or speciﬁed number ofmini-batches. At that time, all model weights are averagedacross workers and synchronized. Because of the infrequentsynchronization, this approach can scale training data through-put almost linearly. However in its basic form, it suffersfrom reduced model training convergence because non-lineardivergence between local models that is not well-matched withaveraging. A variant, Blockwise model update ﬁltering (BMUF)[21], mitigates this issue signiﬁcantly.In this paper we are comparing and contrasting one methodfrom each of the above mentioned data parallel approches,namely gradient averaging and model averaging. We selectedsynchronous SGD with gradient threshold compression[16](GTC) and BMUF [21] which both try to address the issue ofscaling data parallel training.

A. Gradient Threshold Compression

In GTC, instead of sending the entire gradient tensor foreach trainable weight, only gradient elements whose absolutemagnitude is greater than a constant, here referred as gradient-threshold ( τ ) are sent to other workers. This results in a verysparse gradient update – typically reducing the gradient sizeby several orders of magnitute. Each worker communicatesthe sparse update to all other workers and conversely receivesall sparse updates from other workers. The received sparsegradient updates are aggregated and weights are updated basedon the aggregate. The residual gradients which are not sentto other workers are aggregated locally for later iterations. Ina naive implementation, a sparse update can be representedby two numbers, an integer element index and ﬂoating pointnumber. However this can be compressed further by quantizinggradient and packing quantized gradient and integer index intosingle 32-bit integer ﬁeld. In this work, we use 1-bit quanti-zation [17]. Thus, each worker simply sends gradient deltasof ± τ . This simple coding scheme further compresses theupdate by 2x. The pseudo-code for this algorithm and a morecomprehensive discussion can be found in [16]. The technique can be applied to synchronous as well as asynchronous variantsof SGD, however we select the synchronous variant to ensurereproducibility. B. Blockwise Model Update Filtering

The BMUF algorithm [21] is a variant of model averaging,augmented by considering the model from previous step. Firstthe initial global model ( W g ) is broadcasted to all workers.The algorithm then iterates two main steps. In the ﬁrst step,each worker updates its local model ( W ) in parallel with itsportion of data for a speciﬁed number of mini-batches, herereferred as block-size. This step is called intra-block paralleloptimization and requires no synchronisation between workers.In our implementation, each worker simply updates its localmodel using mini-batch SGD independently. In the second step,which is referred to as the BMUF step, the global model isupdated using the following procedure. W ( t ) = 1 N N (cid:88) i =1 W ( t ) i (2) G ( t ) = W ( t ) − W g ( t − (3) ∆( t ) = η t ∆( t −

1) + ζ t G ( t ) (4) W g ( t ) = W g ( t −

1) + ∆( t ) + η t +1 ∆( t ) (5)where hyper-parameters η and ζ are called block-momentumand block-learning-rate respectively. We used following for-mula ζN (1 − η ) = C (6)to set η and ζ hyper-parameters, where C ≥ is constant and N is number of workers. We use Nesterov block momentum(NBM) scheme proposed in [21].The evaluation of these two training methods for frame levelaccuracy and speedup is described in section VIII-A. C. Accuracy and Scalability Trade-offs

Both above mentioned algorithms provides ﬂexibility toscale to large number of workers through hyper-parameters,however it may come at the cost of reduced accuracies whennumber of workers are large. The GTC algorithm can be scaledby controlling gradient-threshold parameter which directlyaffects sparsity of gradient values. This results in lower updatesize and reduces overall communication time. The trade-off ofdifferent gradient-threshold values on accuracy and speed isdescribed in [16] for asynchronous variant of the algorithm. Inour studies we found gradient-threshold of 8 achieved besttrade-off between accuracy and scalability. For the BMUFalgorithm, the block-size hyper-parameter can be used forcontrolling how often global model is updated. Setting block-size to large number can enable almost linear scaling, howeverthis results in considerable drop in accuracy for large numberof workers. This is further analyzed in [21]. In our studies,we found block-size of 100 achieved best trade-off betweenaccuracy and speed.

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 8

V I I . E

X P E R I M E N TA L S E T U P

We discussed system level details in Section II-B. In thissection we provide details on our experimental setup, includingthe training and test data sets. We also discuss various models,and brieﬂy describe the decoding setup.

A. Training Datasets

For our experiments we used three far-ﬁeld training datasetsdrawn from the production data of the Alexa family of devicesfrom the US English locale: (a) a 1,000 hour fully labeleddataset for distributed training experiments, (b) a 7,000 hourfully labeled dataset used for training the teacher model, and(c) a 1 Million hour unlabeled dataset for SSL model build.The 1,000 hour dataset is a subset of the 7,000 hour dataset.

B. Test Datasets

We used two test sets in this work. The ﬁrst is a validationtest set (referred to as VAL), which consisted of about 1 hourof data to evaluate the distributed trainers. The accuracy onthis test set is evaluated using frame classiﬁcation accuracy,but more importantly, we use it to measure training speed. Thesecond test set (TST) consists of audio data collected in areal room with about 5,000 utterances roughly equally spreadamong ﬁve device placements. The ﬁrst device placement(DP1) in the center of the room led to the lowest error rate,while other conditions (DP2 to DP5) were more challenging.On this test set we use word error rate reduction (WERR) toevaluate the model performance. We have also reported onother test sets in [37].

C. Models

In this section we summarize the models relevant to theexperimental setup. All the acoustic models (recognition andalignment models) employed in this paper use the hybrid HMM-LSTM approach.

1) Acoustic front-end:

The sampling rate of the speechsignal in all the datasets used in this work is 16 kHz. Thefeatures for the deep learning models come from an acousticfront-end that outputs 64 dimensional log ﬁlter bank featuresat a frame rate of 33 Hz; section Section II-B describes itin greater detail. The phonetic decision tree, however, wasbuilt using 40 dimensional features from a different front-end: application of LDA followed by MLLT transforms on39 dimensional PLP, including delta and delta-delta.

2) Frame level hard targets:

The triphone HMM states wereclustered down to 3,183 senones using a phonetic decision treebuilt on the 7,000 hour dataset. Alignments from the alignmentmodel were mapped, and several rounds of realignment fol-lowed by parameter reestimation were performed. Using thesealignments, an LSTM trained using the CE criterion, discussedin Section II-B is used to generate frame-level targets fortraining the supervised models in this work. All the alignmentmodels used the GTC trainer in conjunction with 16 V100GPU cards.

3) Sequence training:

Sequence discriminative training inthis paper used the sMBR [32] criterion with lattice basedmethods. sMBR training, for all models including the teacherand the student models, was performed on the fully labeled7,000 hour dataset. It used GTC trainer with 16 V100 GPUcards. The lattices themselves were shallow, with an averagedensity of around 10, and were stored as compressed ﬁles. Thespace required for storing the lattices for a system was around6 GB.

4) Teacher model:

The teacher is a bidirectional LSTM(BLSTM) model built on the 7,000 hour fully labeled datasetusing the features and frame-level targets discussed in thissection. This model has 5 bidirectional LSTM layers, eachwith direction and layer having 768 units. The model pa-rameters were ﬁrst estimated by minimizing the frame levelcross-entropy criterion. The training strategy discussed inSection II-B was followed: 10 epochs of chunked training,followed by 2 epochs of ﬁne-tuning. Finally 3 epochs ofsequence training was performed.

5) Features and targets for the student model:

Features forthe 1 Million hour dataset were generated using the systemdescribed in Section IV-B. As in Section VII-C1, this resultsin 64 dimensional log ﬁlter bank features at a frame rate of33 Hz. Using the trained BLSTM teacher model, frame levelsoft targets are generated using these features and stored ascompressed k -best logits (with k being 20 for the experiments)using the techniques discussed in Sections IV-B3 and V-C.

6) Student model:

The student network is identical to theLSTM architecture described in Section II-B. The studentmodel is trained on the features and targets discussed in theprevious subsection with scheduled learning, as discussed inSection V-D. We used the BMUF trainer with 8 p3.16xlargehosts. Lastly 3 epochs of sMBR training restricted to 7,000hours with GTC trainer was performed.

7) Baseline fully supervised model:

The baseline fully su-pervised system is an LSTM, identical to the network discussedin Section II-B. This network was trained as discussed inSection II-B, on the 7,000 hour dataset, using the same setof features and targets used for training the teacher model.Lastly, 3 epochs of sequence training was performed on thesame labeled dataset.

D. Decoding Setup

All decoding on the TST test set use a 4-gram statisticallanguage model (LM). The acoustic model scale factor wastuned on this test set. We compare the SSL model against astrong, fully supervised baseline system [37].V I I I . S

Y S T E M E VA L U AT I O N

We evaluate the system along several dimensions. A keymetric is the accuracy of the trained models. For this dataset,accuracy is reported as relative word error rate reduction(WERR) (cf. [37], [26], [27], [25], [28]). Here we showa subset of accuracy results with a more complete pictureavailable in [37]. Other important metrics are the processingtime and cost of infrastructure.

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 9

TABLE I:

Relative frame-level classiﬁcation accuracy improve-ments (in % ) and training speedup (as a factor) of GTC andBMUF-NBM compared to 1-GPU SGD trainer. This tableillustrates the trade-offs for the two trainers, as a functionof the amount of compute (number of GPUs). TrainingMethod Number ofWorkers RelativeFrame-levelAccuracyImprovement (%) TrainingSpeedupGTC 8 0.41 7.0116 0.41 12.5332 0.54 21.7564 0.27 16.76BMUF-NBMblock-size: 100 8 -0.27 7.1816 -0.06 13.3432 -0.10 25.4664 -0.13 50.91128 -2.46 97.59

A. Distributed Training

We trained models using GTC and BMUF on the 1,000 hourlabeled data , which were then evaluated on the VAL test set.The training speedup and relative frame-level classiﬁcationaccuracy improvements of the two trainers are tabulated inTable I. Both metrics are relative to a 1-GPU SGD trainer whichdoes not perform any gradient thresholding or quantization. Itcan be seen that in terms of accuracy, both trainers are within1 % relative compared to the 1-GPU SGD baseline. BMUFtrainer shows higher degradation of about 3 % relative whenrun on 128 GPUs. Synchronous GTC training scales well upto 32 GPU cards, but its throughput tapers off at higher scale.This is due to increased cost of communication needed tosynchronize gradients per mini-batch as number of workersare increased. On the other hand, BMUF scales almost linearlywith number of workers, at least in terms of throughput, sincemodel synchronization is much more infrequent. However, itcomes at a cost of training convergence rate and reducedaccuracy at higher number of workers. The Nesterov-likemomentum updates at block level recover some of these losses,but empirically we still see some degradation. B. End-to-End Processing Time

The end-to-end time from data to a fully trained modelyields an assessment of our system design. A breakdown ofthe processing times also gives the constraints that limit ourability to scale the model training to even larger data regimes.

1) Proposed system design:

With the design decisions takenin this work we obtain an end-to-end turn around time ofthe student pipeline in 12 days. Figure 3 breaks down theprocessing time in different parts of the student trainingpipeline. As a side-note, the initial training of the teacher modeland the storing of utterances corresponding to the 1 Million We used AWS EC2 p2.16xlarge instances for experiments in Table I. whereeach instance consists of 16 NVIDIA K80 GPUs. hours of speech in S3 takes an additional 4 days (the trainingof the teacher model itself takes 2 days ). Data preparation : 7.6 days Training: 4.5 days S t r ea m s e l e c t i on F ea t u r e e x t r a c t i on F ea t u r e no r m a li z a t i on F ea t u r e s ub s a m p li ng T a r ge t gene r a t i on D a t a r eo r gan i z a t i on C r o ss en t r op y t r a i n i ng S equen c e t r a i n i ng

62 10 88 7 96 137

Time taken inhours StepsTiming.xml https://drawio.corp.amazon.com/

Fig. 3:

This ﬁgure presents a breakdown of the end-to-endprocessing times to train a full SSL student subsystem. The datapipeline scales linearly with more compute and can have evenquicker turnarounds; using more compute, the model trainingcan scale further in terms of training times, but it comes atthe cost of accuracy.

The SSL data pipeline for generating features and targets forthe 1 Million hours of speech took 7.6 days. This pipeline isrelatively straight-forward to parallelize. For most steps, sinceeach data partition is independent, adding more hardware canparallelize the step further. We perform distributed computationon Spark, and all our data is staged in S3. Since both systemsare known to scale well, the data pipeline can scale nearlylinearly by increasing the cluster size. Further algorithmic im-provements are possible with more caching, pre-computationsand aggregations, though at the cost of more storage. Modeltraining contributes to a smaller part of the total time (4.5 days).In this project, we increased the number of GPUs to 64, fora very signiﬁcant speed-up. Adding further compute by usingmore GPUs can speed up training linearly, but has diminishingreturns in terms of accuracy (as presented in Table I).

2) Comparison to fully supervised system:

We close thediscussion on processing times by making a note on the fullysupervised AM described in Section II-B, and the model wasbuilt as presented in Section VII-C7. This involved 2 steps:1)

Feature extraction and target generation : This was thebottleneck: this system was impractical for feature extrac-tion and target generation at the scale of 1 Million hours.For the 7,000 hour data, the feature and target generationtook nearly 2 weeks.2)

Model building including the CE and sMBR trainingstages : Model building was done with GTC trainer using16 GPU cards for both the CE and sMBR stages. Thistook about 21 hours. It might be surprising that the training of the teacher model on 7,000 hourstakes 2 days, while the training of the student model takes “only” 4.5 days;but the teacher is a BLSTM, using GTC trainer employing 16 GPUs. Also weperform 12 epochs of training on 7,000 for the teacher. The student model isan LSTM (trains 2x faster), using BMUF trainer employing 64 GPUs (about4x faster), doing 1 epoch through the data.

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 10

C. System Accuracy

The ﬁnal results including sequence discriminative training,with sMBR loss function, are reported in Table II on the TSTtest set. We compare against the baseline fully supervisedmodel trained on the 7,000 hour data (please refer to Sec-tion VII-C7 for more details on this model). The results arereported as relative WER improvements.TABLE II:

On TST test set (in DP1 to DP5), relative WERreduction ( % ) of the ﬁnal 1 Million hour model against abaseline LSTM AM that is sMBR trained on the fully labeled7,000 hour training data. Test Conditions WERR ( % )DP1 . DP2 . DP3 . DP4 . DP5 . Except for device position one (DP1), relative WER re-ductions are all greater than 10 % , and indicating that theimprovement is greater for harder conditions. We take thisas validation that large scale SSL can not only signiﬁcantlyimprove accuracy overall, but also yield signiﬁcant improve-ment in more challenging conditions.I X . C O N C L U S I O N

In this paper we presented an in depth discussion of thedesign of an efﬁcient end-to-end SSL system starting from1 Million hours of raw audio and its metadata. Followingthe student/teacher paradigm for SSL, we focused on thestudent subsystem, factoring it into two main pipelines: datapreparation and model training. To address the challenges ofscalability and robustness, our discussion on data pipeline laidout the key bottlenecks and proposed corresponding designprinciples. These principles were then used in decomposing thepipeline into smaller steps to efﬁciently address the challenges.The model pipeline, including the distributed trainer, ad-dressed the twin challenges in ML design for this prob-lem: accuracy improvement and scalability. Scaling posteriorgeneration with k best selection, using scheduled learningto leverage transcribed data, restricting sequence training totranscribed data were among the methods we presented. Oursystem evaluations showed that even without extensive hyper-parameter tuning, we can obtain relative WER improvements inthe 10 to 20 % range, with much higher gains in more difﬁcultconditions. The end-to-end processing time of this SSL systemwas 12 days, and several components in this system can triviallyscale linearly with more compute resources for further speed-up. A C K N O W L E D G M E N T

We would like to thank Xing Fan for providing the setupfor baseline models; Harish Mallidi for help with the decodinginfrastructure; Oleg Rybakov and Tianjun Ye for help withearly debugging with our deep learning toolkit. R

E F E R E N C E S[1] F. Jelinek, “Some of my best friends are linguists,” in

LREC , 2004.[2] R. G. Leonard and G. Doddington, “TIDIGITS speech corpus,”

TexasInstruments, Inc , 1993.[3] J. S. Garofolo, L. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,“DARPA TIMIT acoustic-phonetic continuous speech corpus,”

NASASTI/Recon technical report , vol. 93, 1993.[4] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-basedCSR corpus,” in

Proc. of Workshop on Speech and Natural Language ,1992.[5] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD:Telephone speech corpus for research and development,” in

Proc. ofICASSP .[6] C. Cieri, D. Miller, and K. Walker, “The Fisher Corpus: a resource forthe next generations of speech-to-text.” in

LREC , vol. 4, 2004, pp. 69–71.[7] D. Amodei, S. Ananthanarayanan et al. , “Deep speech 2: End-to-endspeech recognition in English and Mandarin,” in

Proc. of ICML , 2016.[8] Y. Huang, Y. Wang, and Y. Gong, “Semi-supervised training in deeplearning acoustic model.” in

Proc. of Interspeech , 2016.[9] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” arXivpreprint arXiv:1610.09975 , 2016.[10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoopdistributed ﬁle system,” in

Mass storage systems and technologies(MSST), 2010 IEEE 26th symposium on . Ieee, 2010, pp. 1–10.[11] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al. , “ApacheSpark: a uniﬁed engine for big data processing,”

Communications of theACM , vol. 59, no. 11, pp. 56–65, 2016.[12] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” in

NIPS-W , 2017.[13] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,C. Zhang, and Z. Zhang, “Mxnet: A ﬂexible and efﬁcient machinelearning library for heterogeneous distributed systems,”

CoRR , vol.abs/1512.01274, 2015. [Online]. Available: http://arxiv.org/abs/1512.01274[14] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden,M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, softwareavailable from tensorﬂow.org. [Online]. Available: http://tensorﬂow.org/[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”

IEEE Signal processing magazine , vol. 29, no. 6,pp. 82–97, 2012.[16] N. Strom, “Scalable distributed DNN training using commodity GPUcloud computing,” in

Proc. of Interspeech , 2015.[17] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-Bit Stochastic GradientDescent and Application to Data-Parallel Distributed Training of SpeechDNNs,” in

Proc. of Interspeech , 2014.[18] D. Alistarh, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: randomizedquantization for communication-optimal stochastic gradient descent,”

CoRR , vol. abs/1610.02132, 2016. [Online]. Available: http://arxiv.org/abs/1610.02132[19] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradientcompression: Reducing the communication bandwidth for distributedtraining,”

CoRR , vol. abs/1712.01887, 2017. [Online]. Available:http://arxiv.org/abs/1712.01887[20] C. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, andK. Gopalakrishnan, “Adacomp : Adaptive residual gradient compressionfor data-parallel distributed training,”

CoRR , vol. abs/1712.02679, 2017.[Online]. Available: http://arxiv.org/abs/1712.02679[21] K. Chen and Q. Huo, “Scalable training of deep learning machines byincremental block training with intra-block parallel optimization andblockwise model-update ﬁltering,” in

Proc. of ICASSP , 2016.[22] R. Prasad, “Spoken Language Understanding for Amazon Echo,” 2015,keynote in Speech and Audio in the Northeast (SANE).[23] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal,B. Hoffmeister, and S. Vitaladevuni, “Multi-Task Learning and Weighted

OURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 11

Cross-Entropy for DNN-Based Keyword Spotting,” in

Proc. of Inter-speech , 2016, pp. 760–764.[24] R. Maas, S. H. K. Parthasarathi, B. King, R. Huang, and B. Hoffmeister,“Anchored speech detection.” in

Proc. of Interspeech , 2016.[25] S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal,N. Strom, and S. Garimella, “fMLLR based feature-space speakeradaptation of DNN acoustic models,” in

Proc. of Interspeech , 2015.[26] S. Garimella, A. Mandal, N. Strom, B. Hoffmeister, S. Matsoukas, andS. H. K. Parthasarathi, “Robust i-vector based adaptation of DNN acousticmodel for speech recognition,” in

Proc. of Interspeech , 2015.[27] B. King, I.-F. Chen, Y. Vaizman, Y. Liu, R. Maas, S. H. K. Parthasarathi,and B. Hoffmeister, “Robust speech recognition via anchor word repre-sentations,” in

Proc. of Interspeech , 2017.[28] L. Mosner, M. Wu, S. H. K. Parthasarathi, R. Maas, A. Raju, K. Kumatani,S. Sundaram, and B. Hoffmeister, “Improve noise robustness of automaticspeech recognition via teacher-student learning,” in

Proc. of ICASSP ,2019.[29] A. Gandhe, A. R. Rastrow, and B. Hoffmeister, “Scalable language modeladaptation for spoken dialogue systems.” in

Accepted for Proc. of SpokenLanguage Technologies , 2018.[30] G. Pundak and T. N. Sainath, “Lower frame rate neural network acousticmodels.” in

Proc. of Interspeech , 2016.[31] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training ofrecurrent neural networks for ofﬂine handwriting recognition,” in

Proc.of ICFHR , 2014.[32] B. Kingsbury, “Lattice-based optimization of sequence classiﬁcation cri-teria for neural-network acoustic modeling,” in . IEEE, 2009,pp. 3761–3764.[33] T. Kemp and A. Waibel, “Unsupervised training of a speech recognizer:recent experiments.” in

Proc. of Eurospeech , 1999.[34] L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised and unsuper-vised acoustic model training,”

Computer Speech & Language , vol. 16,no. 1, pp. 115–129, 2002.[35] J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervisedtraining on large amounts of broadcast news data,” in

Proc. of ICASSP ,2006.[36] M.-h. Siu, H. Gish, and F. Richardson, “Improved estimation, evaluationand applications of conﬁdence measures for speech recognition,” in

Proc.of Fifth European Conference on Speech Communication and Technology ,1997.[37] S. H. K. Parthasarathi and N. Strom, “Lessons from building acousticmodels with a million hours of speech,” in

Proc. of ICASSP , 2019.[38] J. Pylkk¨onen, T. Drugman, and M. Bisani, “Optimizing speech recogni-tion evaluation using stratiﬁed sampling.” in

Proc. of Interspeech , 2016,pp. 3106–3110.[39] Y. Huang, D. Yu, Y. Gong, and C. Liu, “Semi-supervised GMM and DNNacoustic model training with multi-system combination and conﬁdencere-calibration.” in

Proc. of Interspeech , 2013.[40] V. Manohar, D. Povey, and S. Khudanpur, “Semi-supervised maximummutual information training of deep neural network acoustic models.” in

Proc. of Interspeech , 2015.[41] J.-T. Huang and M. Hasegawa-Johnson, “Maximum mutual informationestimation with unlabeled data for phonetic classiﬁcation.” in

Proc. ofInterspeech , 2008.[42] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer,and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in

IEEE 2011Workshop on Automatic Speech Recognition and Understanding . IEEESignal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.[43] J. Li, R. Zhao, and J.-T. a. Huang, “Learning Small-Size DNN withOutput-Distribution-Based Criteria,” in

Proc. of Interspeech , 2014.[44] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in

Proc.of Advances in NIPS , 2014.[45] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[46] L. Lamel, J.-L. Gauvain, and G. Adda, “Unsupervised acoustic modeltraining,” in

Proc. of ICASSP , 2002.[47] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration ofmodern neural networks,” arXiv preprint arXiv:1706.04599 , 2017.[48] K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminativetraining of deep neural networks.” in

Interspeech , 2013, pp. 2345–2349.[49] J.-T. Huang and M. Hasegawa-Johnson, “Semi-supervised training ofgaussian mixture models by conditional entropy minimization,”

Opti-mization , vol. 4, p. 5, 2010. [50] N. Kanda, S. Harada, X. Lu, and H. Kawai, “Investigation of semi-supervised acoustic model training based on the committee of heteroge-neous neural networks.” in

Proc. of Interspeech , 2016.[51] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke,D. Yu, and G. Zweig, “The Microsoft 2016 Conversational SpeechRecognition System,”

CoRR , vol. abs/1609.03528, 2016. [Online].Available: http://arxiv.org/abs/1609.03528[52] P. Goyal, P. Doll´ar, R. B. Girshick, P. Noordhuis, L. Wesolowski,A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatchSGD: training imagenet in 1 hour,”

CoRR , vol. abs/1706.02677, 2017.[Online]. Available: http://arxiv.org/abs/1706.02677[53] T. Ben-Nun and T. Hoeﬂer, “Demystifying parallel and distributed deeplearning: An in-depth concurrency analysis,”

CoRR

Journal of Parallel and Distributed Computing ,vol. 69, pp. 117–124, 02 2009.[56] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu, “Highly scalabledeep learning training system with mixed-precision: Training imagenetin four minutes,”