[PDF] Learned Transferable Architectures Can Surpass Hand-Designed Architectures for Large Scale Speech Recognition

Abstract

In this paper, we explore the neural architecture search (NAS) for automatic speech recognition (ASR) systems. With reference to the previous works in the computer vision field, the transferability of the searched architecture is the main focus of our work. The architecture search is conducted on the small proxy dataset, and then the evaluation network, constructed with the searched architecture, is evaluated on the large dataset. Especially, we propose a revised search space for speech recognition tasks which theoretically facilitates the search algorithm to explore the architectures with low complexity. Extensive experiments show that: (i) the architecture searched on the small proxy dataset can be transferred to the large dataset for the speech recognition tasks. (ii) the architecture learned in the revised search space can greatly reduce the computational overhead and GPU memory usage with mild performance degradation. (iii) the searched architecture can achieve more than 20% and 15% (average on the four test sets) relative improvements respectively on the AISHELL-2 dataset and the large (10k hours) dataset, compared with our best hand-designed DFSMN-SAN architecture. To the best of our knowledge, this is the first report of NAS results with large scale dataset (up to 10K hours), indicating the promising application of NAS to industrial ASR systems.

Full PDF

LLearned Transferable Architectures Can Surpass Hand-DesignedArchitectures for Large Scale Speech Recognition

Liqiang He , Dan Su , Dong Yu Tencent AI Lab, Shenzhen, China Tencent AI Lab, Bellevue WA, USA { andylqhe, dansu, dyu } @tencent.com Abstract

In this paper, we explore the neural architecture search (NAS)for automatic speech recognition (ASR) systems. With refer-ence to the previous works in the computer vision ﬁeld, thetransferability of the searched architecture is the main focusof our work. The architecture search is conducted on thesmall proxy dataset, and then the network, constructed from thesearched architecture, is evaluated on the large dataset. Espe-cially, we propose a revised search space for speech recogni-tion tasks which theoretically facilitates the search algorithm toexplore the architectures with low complexity. Extensive ex-periments show that: (i) the architecture searched on the smallproxy dataset can be transferred to the large dataset for thespeech recognition tasks. (ii) the architecture learned in the re-vised search space can greatly reduce the computational over-head and GPU memory usage with mild performance degra-dation. (iii) the searched architecture can achieve more than20% and 15% (average on the four test sets) relative improve-ments respectively on the AISHELL-2 dataset and the large(10k hours) dataset, compared with our best hand-designedDFSMN-SAN architecture. To the best of our knowledge, thisis the ﬁrst report of NAS results with large scale dataset (upto 10K hours), indicating the promising application of NAS toindustrial ASR systems.

Index Terms : neural architecture search, speech recognition,transferable architecture

1. Introduction

The performance of ASR systems has been largely boosted bydeep learning [1]. The core part of deep learning is to design andoptimize deep neural networks. Various types of neural networkarchitectures have been employed in ASR systems, such as con-volutional neural networks (CNNs) [2], long short-term mem-ory (LSTM) [3], gated recurrent unit[4], time-delayed neuralnetwork [5], feedforward sequential memory networks (FSMN)[6], etc. Some combinations of different architectures are alsoproposed to take advantage of their complementary property,such as CLDNN [7]. Recently, transformer architecture whichhas achieved its success in the natural language process (NLP)tasks has also been widely used in ASR systems [8, 9], demon-strating its superior performance compared with the state-of-the-art models. Our previous work also proposed a variant ofmodel architecture which combined DFSMN with self-attentionnetworks (SAN), and further applied the memory augmentingmethod on the self-attention layer [10]. In summary, the per-formance improvement of ASR systems owes much to the ded-icated hand-designed model architectures.However, designing state-of-the-art neural network archi-tectures requires a lot of expert knowledge and takes ample Figure 1:

The convolutional architecture (left) for computer vi-sion task, and the convolutional architecture (right) for speechrecognition task, as proposed by our work. Abbreviation: Mrefers to the number of normal cells. time. Therefore, there has been a growing interest in devel-oping algorithmic solutions to discover powerful network archi-tectures automatically. The network architectures automaticallysearched in [11, 12, 13] have achieved highly competitive per-formance in computer vision tasks, such as image classiﬁcationand object detection. But, the heuristic search methods withevolution and reinforcement learning technique require massivecomputational overheads (3150 GPU days of evolution [13] and2000 GPU days of reinforcement learning [12]). Several ap-proaches focusing on the efﬁcient architecture search have beenproposed. Among them, DARTS [14] introduced a differen-tiable NAS framework to relax the discrete search space into acontinuous one by weighting candidate operations with archi-tectural parameters, which achieved comparable performanceand remarkable efﬁciency improvement compared to previousapproaches. As a progressive version of DARTS, P-DARTS[15] was proposed to bridge the depth gap between the networkdepth of architecture search and architecture evaluation.Despite the rapid advance of NAS techniques in the com-puter vision communities, there has been very limited researchon the application of NAS to ASR systems. Compared withcomputer vision tasks such as image classiﬁcation, a major hin-drance is that speech recognition is a more complex task interms of the dimension of the input and output. What’s more, a r X i v : . [ ee ss . A S ] S e p or the big data era of speech recognition, a typical amount oftraining data can be over 10K hours, which amounts to morethan 10 million samples.In this work, we explore the feasibility of NAS for speechrecognition tasks on the large dataset. Considering the searchcost, we make the architecture search on the small proxydataset, and then the evaluation network, constructed from thesearched architecture, is evaluated on the large dataset. We pro-pose a revised search space for speech recognition tasks whichtheoretically facilitates the search algorithm to explore the ar-chitectures with low complexity, compared with the DARTS-based search space. Experimental results show that the architec-ture, discovered in the revised search space on the AISHELL-1 dataset, can achieve more than 20% and 15% (average onthe four test sets) relative improvements respectively on theAISHELL-2 dataset and the large (10k hours) dataset, com-pared with our best hand-designed DFSMN-SAN architecture.Our contributions can be summarized as follows: (i) Weshow that the architectures searched on the small proxy datasethas good transferability to the large (10k hours) dataset forthe speech recognition tasks. (ii) We propose a revised searchspace, from which the searched architecture achieves a bet-ter balance between model complexity and recognition perfor-mance. (iii) We show that the searched architecture achievessigniﬁcant performance improvements on the large dataset,compared with our best hand-designed model architecture.

2. Neural Architecture Search

Different from conventional methods applying evolution or re-inforcement learning over a discrete search space, a differen-tiable network architecture search based on bilevel optimizationis introduced in DARTS, which achieves remarkable efﬁciencyimprovement by several orders of magnitude. The categoricalchoice of one operation is relaxed to learning a set of continuousvariables α = { α ( i,j ) } , normalized with the Softmax function. o ( i,j ) ( x ) = (cid:88) o ∈O exp ( α ( i,j ) o ) (cid:80) o (cid:48) ∈O exp ( α ( i,j ) o (cid:48) ) o ( x ) (1)A bilevel optimization is proposed which jointly optimizesthe architecture α as the upper-level variable and the networkweights ω as the lower-level variable: min α L val ( ω ∗ ( α ) , α ) (2) s.t. ω ∗ ( α ) = argmin ω L train ( ω, α ) (3)where L train and L val denote the training and the validationloss, respectively. Both losses are determined not only by thearchitecture α , but also the network weights ω . Although good transferability of the searched architecture hasbeen observed in DARTS, more attention has been paid to thediscrepancies between the super-network (the continuous archi-tecture encoding) and the evaluation network constructed fromthe optimal sub-network (the derived discrete architecture) forthe speciﬁc task. One of the discrepancies is the depth gap be-tween the depth of the super-network and the evaluation net-work, which has been proven to cause performance deteriora-tion. As a progressive version of DARTS, search space approx-imation is proposed in P-DARTS to alleviate the problem of the depth gap by dividing the search process into multiple stages.With each stage forward, the depth of the super-network be-comes deeper, while at the same time the number of operationsin the search space becomes smaller, which makes the searchprocess with a deeper super-network possible with the limitedcomputation and memory budget. Additionally, search spaceregularization is introduced to address the “over-ﬁtting” prob-lem brought by the skip-connect operation.

3. Search Space Revision

In this section, we summarize the characteristics and the im-provements of the search algorithm when applying NAS toASR systems. Based on the architecture in DARTS, there aretwo modiﬁcations speciﬁcally made for the speech recognitiontasks. For large scale speech recognition, we propose a revisedsearch space that theoretically facilitates the search algorithm toexplore the architectures with low computational and memoryoverhead. Besides, the regularization method for the architec-ture search is discussed.

Following DARTS, we search for the convolutional cells as thebuilding blocks and then stack the learned cells to form the ﬁ-nal network architecture. As shown in Figure 1 (left), each cellconnects to the previous two cells or stem convolutions locatedat the beginning of the network. Cells located at the 1/3 and 2/3of the network are reduction cells (totally two), which is differ-ent from the other normal cells. Two modiﬁcations for speechrecognition are made as shown in the red dotted boxes of Fig-ure 1 (right). In the modiﬁcation at the beginning of the net-work, acoustic features and the ﬁrst-order and the second-orderderivatives are separately assigned to the independent channels.In the modiﬁcation at the ending of the network, the averagepooling operation is replaced by several fully connected lay-ers, following with the softmax layer to compute the posteriors.Moreover, there are two reduction cells in the network, eachof which reduces the resolution of the feature maps from theprevious cells by half, and this architecture is also adopted forthe speech recognition task, as the lower frame rate techniqueproposed by [16] has shown its beneﬁt.

The search space is represented in the form of the convolutionalcells, and each cell is denoted as a directed acyclic graph con-sisting of N nodes and their corresponding edges. Each directededge between two nodes is associated with the candidate op-erations. In DARTS, The candidate set in the convolutionalcells includes the following operations: [zero, identity, 3x3 maxpooling, 3x3 average pooling, 3x3 separable convolution, 5x5separable convolution, 3x3 dilated separable convolution, 5x5dilated separable convolution] .Considering the depth gap between the network depth ap-plied in the search and the evaluation for the speech recogni-tion tasks, the search process of our work adopts the searchspace approximation method proposed by P-DARTS. Based onthe DARTS-based operation space, the preliminary architec-ture searches are carried out on the proxy dataset. One of thesearched architecture is shown in Figure 2 (left).We explore three issues related to the search space when ap-plying P-DARTS for speech recognition tasks.

First , the searchprocess of the computer vision tasks tends to generate archi-tectures with many skip-connect operations, especially on thea)

Normal cell learned in original search space. (b)

Normal cell learned in revised search space. (c)

Reduction cell learned in original search space. (d)

Reduction cell learned in revised search space.

Figure 2: (a) and (c) are the cells (denoted as ASRNET-A) learned in the original search space. (b) and (d) are the cells (denoted asASRNET-B) learned in the revised search space proposed by our work. small proxy dataset, and the derived architectures for evalua-tion often suffer from the performance degradation. But, af-ter searching on a small proxy dataset, we ﬁnd that the skip-connect operation rarely appears in the ﬁnal searched architec-ture for the speech recognition tasks. This is arguably due tothat speech recognition is a more complex task in terms of boththe dimension of the input and output, compared with com-puter vision tasks such as image classiﬁcation.

Second , thecell architectures, learned in the DARTS-based search space,with better performance are prone to have many separable con-volution operations, and this operation applies the module listtwice, the module list that consists of sequential modules witha

ReLU-Conv-BN order. However, such learned architecturesapplied for large scale speech recognition consumes too muchcomputation resources and GPU memory, which can be pro-hibitive due to the limitation of GPU hardware.

Last , to elimi-nate the inﬂuence of randomness, the search process in DARTSand P-DARTS should be repeated several times with differentseeds for the ﬁnal architecture with better performance, and thismethod is still applicable for the search process of the speechrecognition. Notably, the search process of the speech recogni-tion tends to generate architectures with many average pooling and dilated convolution operations, and obvious performanceﬂuctuations have been observed based on the evaluation net-works derived by stacking the learned normal cells for moretimes.Concerning the ﬁrst two issues, we revise the operationspace by replacing skip-connect operation with convolution [17]. The skip-connect operation has lower prior-ity in the relaxed search space for the absence in the ﬁnal archi-tectures most of the time, so the stability of the search processcan be improved by removing this operation. The newly addedconvolution operation has the following advantages. First, theconvolution with the larger convolution kernel increases the re-ceptive ﬁeld to capture the latent representation of acoustic fea-tures and meanwhile limits the number of model parametersas smaller as possible. Second, the convolution with fewersequential modules improves computing efﬁciency and mem-ory overhead. With the revised operation space, the architec-ture searches for the speech recognition tasks are carried outon the small proxy dataset, and one of the searched architec-ture is shown in Figure 2 (right). The evaluation network, con-structed from the searched architecture, achieves a better trade-off between model complexity and recognition performance. The revised operation space in the convolutional cell includesthe following operations: [zero, 3x3 max pooling, 3x3 averagepooling, 3x3 separable convolution, 5x5 separable convolu-tion, 3x3 dilated separable convolution, 5x5 dilated separableconvolution, 1x7 then 7x1 convolution] .As for the last issue mentioned above, we adopt the searchspace regularization proposed by [15] to alleviate the problemof obvious performance ﬂuctuations caused by the randomnessof the search process. First, the operation-level dropout isinserted after each dilated separable convolution and averagepooling operation to facilitate the algorithm to explore otheroperations. Second, the regularization rule of architecture re-ﬁnement restricts the number of preserved average pooling op-erations in the ﬁnal architecture to be a constant.

4. Experiments

We use AISHELL-1 [18] as the small proxy dataset for the ar-chitecture search. The dataset contains 178 hours of ChineseMandarin speech from 400 speakers, and the 10 hours test setis used for the architecture evaluation. Two bigger corpora areused to verify the transferability of the searched architecture.First is the AISHELL-2 [19] dataset which contains 1000 hoursof speech data from 1991 speakers. Second is a 10K hoursmulti-domain dataset [10]. We also augment the AISEHLL-1 and AISEHLL-2 training data with 2-fold speed perturba-tion [20] in the experiments. To evaluate the performance ofthe searched architecture, we report performance on 3 types oftest sets which consist of hand-transcribed anonymized utter-ances extracted from reading speech (1001 utterances), conver-sation speech (1538 utterances), and spontaneous speech (2952utterances). We refer them as Read, Chat, and Spon respec-tively. Besides, to provide a public benchmark, we also use theAISHELL-2 development set (2500 utterances, short for DEV)recorded by high ﬁdelity microphone as the test set.

We use 40-dimensional log Mel-ﬁlterbank features with theﬁrst-order and the second-order derivatives. Training utterancesare ﬁltered by a maximum frame length of 1024, and the lengthof each utterance is padded to be 4-frames-aligned to be ﬁt forthe two reduction layers. All the experiments are based on theable 1:

Token accuracies (Acc) and evaluation costs (Cost) ofthe small evaluation networks on the AISHELL-1 dataset. Ab-breviations: L is the number of cells, C is the initial number ofchannels.

Small Params( M ) Acc( % ) Cost( hours )ASRNET-A (L=17,C=32) 6.6 93.00 35.2ASRNET-B (L=17,C=24) 6.8 92.24 17.8Table 2: Comparison with DFSMN-SAN architecture on theAISHELL-2 dataset.

Medium Params( M ) CER( % ) Rel Imp ( % )DFSMN-SAN 14.4 7.36 -ASRNET-A (L=32,C=38) 12.1 5.65 ASRNET-B (L=32,C=30) 14.7 5.79

CTC learning framework and trained with multiple GPUs usingBMUF optimization. We use CI-syllable-based acoustic mod-eling units which include 1394 Mandarin syllables, 39 Englishphones, and a blank. First-pass decoding with a pruned 5-gramlanguage model is performed with a beam search algorithm byusing the weighted ﬁnite-state transducers (WFSTs). Charactererror rate (CER) results are measured on the test sets.

Rel Imp refers to

Relative Improvement in Table 2 and Table 3.

The training set of the AISHELL-1 dataset is randomly split intotwo equal subsets, one for learning network parameters and theother for tuning the architectural parameters. The search pro-cess following [15] is divided into three stages. For each stage,the super-network is trained for 15 epochs, with batch size 4(for both the training and validation sets) and the initial numberof channels 16. Only network parameters are trained in the ﬁrst6 epochs, and both network and architecture parameters are al-ternately optimized in the rest 9 epochs. The momentum SGDoptimizer with initial learning rate 0.01 (annealed down to zerofollowing a cosine schedule without restart), momentum 0.9,weight decay 0.0003, is adopted to optimize the network param-eters. The dropout probability on dilated separable convolution and average pooling is decayed exponentially and the initial val-ues are set to be 0.1, 0.1, 0.1 for stage 1, 2 and 3, respectively.The ﬁnal discovered normal cells are restricted to keep at most2 average pooling . We run the search processes separately inthe original (DARTS-based) search space and the revised searchspace proposed by our work. Concerning the inﬂuence of ran-domness, the search process is repeated 3 times with differentseeds, respectively for both the DARTS-based search space andthe revised search space. The search process takes around 89hours on 8 Tesla P40 GPUs.

On the AISHELL-1 dataset, the small evaluation networksstacked [12] with 17 cells are trained from scratch for 20 epochswith batch size 4. Other hyper-parameters remain the same asthe ones used for the architecture search. Based on the recogni-tion performance on the test set of AISHELL-1 dataset, the ﬁnalarchitecture (denoted as

ASRNET-A ) discovered in the originalsearch space is shown in Figure 2 (left) and the ﬁnal one (de-noted as

ASRNET-B ) discovered in the revised search space is Table 3:

Comparison with DFSMN-SAN architecture on the 10khours dataset. CERs are measured on the four test sets.

Large Params( M ) Read(%) Chat(%) Spon(%) DEV(%)DFSMN-SAN 36.1 1.95 22.92 25.41 4.42ASRNET-B(L=32,C=50) 36.7 1.61 19.99 20.83 3.86 Rel Imp - shown in Figure 2 (right). The initial numbers of channels are32, 24 for ASRNET-A and

ASRNET-B respectively. The to-ken accuracies are computed on the test set of the AISHELL-1dataset. As seen in Table 1, the performance of the evaluationnetwork constructed from

ASRNET-A is slightly better than theone constructed from

ASRNET-B , but the training time of theformer almost takes almost twice as long as the latter. The train-ing tasks are performed on 8 Tesla P40 GPUs.To test the transferability of the searched architectures, themedium-sized evaluation networks stacked with 32 cells aretrained from scratch for 15 epochs with batch size 4, on theAISHELL-2 dataset. The initial numbers of channels are 38, 30for

ASRNET-A and

ASRNET-B respectively. Other trainingconﬁgurations are the same as the ones used for the small eval-uation network. Character error rate results are computed onthe test set of the AISHELL-2 dataset. As shown in Table 2, themedium-sized networks have achieved more than 20% relativeimprovements, compared with

DFSMN-SAN [10] consistingof 10

DFSMN components and 2 multi-head self-attention sub-layers. Concerning computational overhead and GPU memoryusage, the evaluation network constructed from

ASRNET-A isalmost twice as much as the one constructed from

ASRNET-B , so the latter achieves a better trade-off between model com-plexity and recognition performance. The training processes areaccelerated by applying 24 Tesla P40 GPUs.To further validate the transferability of the searched archi-tecture

ASRNET-B , the large evaluation network stacked with32 cells is trained from scratch for 6 epochs with batch size 4and the initial number of channels 50, on the 10K hours largedataset. The momentum SGD optimizer with initial learningrate 0.0002, momentum 0.9, weight decay 0.0003, is adopted tooptimize the network parameters. As shown in Table 3, com-pared with

DFSMN-SAN consisting of 30

DFSMN compo-nents and 3 multi-head self-attention sub-layers, the large net-work has achieved more than 15% (average on the four testsets) relative improvements. The training process takes around7 days on 24 Tesla V100 GPUs.

5. Conclusions

In this paper, we empirically show that not only is the ap-plication of NAS for large scale acoustic modeling in speechrecognition possible, but it also allows for very strong perfor-mance. Speciﬁcally, we perform the architecture search on 150hours small dataset and then transfer the searched architectureto a large dataset for evaluation. On the 1000 hours AIShell-2and 10K hours multi-domain datasets, the searched architec-ture achieves more than 20% and 15% (average on the four testsets) relative improvements respectively compared with our besthand-designed model architecture. The study of this work mayunleash the potentials of NAS application for ASR systems. Fu-ture work includes adding latency control constraints into NASto perform the architecture search for streaming ASR scenarios. . References [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deepneural networks for acoustic modeling in speech recognition: Theshared views of four research groups,”

IEEE Signal processingmagazine , vol. 29, no. 6, pp. 82–97, 2012.[2] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhad-ran, “Deep convolutional neural networks for lvcsr,” in . IEEE, 2013, pp. 8614–8618.[3] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recog-nition with deep bidirectional lstm,” in . IEEE, 2013,pp. 273–278.[4] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Lightgated recurrent units for speech recognition,”

IEEE Transactionson Emerging Topics in Computational Intelligence , vol. 2, no. 2,pp. 92–102, 2018.[5] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neuralnetwork architecture for efﬁcient modeling of long temporal con-texts,” in

Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[6] S. Zhang, M. Lei, Z. Yan, and L. Dai, “Deep-fsmn for large vo-cabulary continuous speech recognition,” in . IEEE, 2018, pp. 5869–5873.[7] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,long short-term memory, fully connected deep neural networks,”in . IEEE, 2015, pp. 4580–4584.[8] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[9] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. M¨uller, S. St¨uker,and A. Waibel, “Very deep self-attention networks for end-to-endspeech recognition,” arXiv preprint arXiv:1904.13377 , 2019.[10] Z. You, D. Su, J. Chen, C. Weng, and D. Yu, “Dfsmn-san withpersistent memory model for automatic speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 7704–7708.[11] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-ment learning,” arXiv preprint arXiv:1611.01578 , 2016.[12] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans-ferable architectures for scalable image recognition,” in

Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 8697–8710.[13] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evo-lution for image classiﬁer architecture search,” in

Proceedings ofthe aaai conference on artiﬁcial intelligence , vol. 33, 2019, pp.4780–4789.[14] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable archi-tecture search,” arXiv preprint arXiv:1806.09055 , 2018.[15] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiablearchitecture search: Bridging the depth gap between search andevaluation,” in

Proceedings of the IEEE International Conferenceon Computer Vision , 2019, pp. 1294–1303.[16] G. Pundak and T. Sainath, “Lower frame rate neural networkacoustic models,” in

Interspeech , 2016.[17] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularizedevolution for image classiﬁer architecture search,”

CoRR , vol.abs/1802.01548, 2018. [Online]. Available: http://arxiv.org/abs/1802.01548 [18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: Anopen-source mandarin speech corpus and a speech recognitionbaseline,” in . IEEE,2017, pp. 1–5.[19] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: transform-ing mandarin asr research into industrial scale,” arXiv preprintarXiv:1808.10583 , 2018.[20] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-tation for speech recognition,” in