[PDF] WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit

Abstract

In this paper, we propose an open source, production first, and production ready speech recognition toolkit called WeNet in which a new two-pass approach is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and the production of E2E speechrecognition models. WeNet provides an efficient way to ship ASR applications in several real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. In our toolkit, a new two-pass method is implemented. Our method propose a dynamic chunk-based attention strategy of the the transformer layers to allow arbitrary right context length modifies in hybrid CTC/attention architecture. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. Our experiments on the AISHELL-1 dataset using WeNet show that, our model achieves 5.03\% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model perform reasonable RTF and latency.

Full PDF

WWeNet: Production First and Production Ready End-to-End SpeechRecognition Toolkit

Binbin Zhang , Di Wu , Chao Yang , Xiaoyu Chen , Zhendong Peng , Xiangming Wang , ZhuoyuanYao , Xiong Wang , Fan Yu , Lei Xie , Xin Lei Mobvoi Inc., Beijing, China Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,Northwestern Polytechnical University, Xi’an, China [email protected]

Abstract

In this paper, we present a new open source, production ﬁrst andproduction ready end-to-end (E2E) speech recognition toolkitnamed WeNet. The main motivation of WeNet is to close thegap between the research and the production of E2E speechrecognition models. WeNet provides an efﬁcient way to shipASR applications in several real-world scenarios, which is themain difference and advantage to other open source E2E speechrecognition toolkits. This paper introduces WeNet from threeaspects, including model architecture, framework design andperformance metrics. Our experiments on AISHELL-1 usingWeNet, not only give a promising character error rate (CER)on a uniﬁed streaming and non-streaming two pass (U2) E2Emodel but also show reasonable RTF and latency, both of theseaspects are favored for production adoption. The toolkit is pub-licly available at https://github.com/mobvoi/wenet.

Index Terms : WeNet, Production Ready, U2

1. Introduction

The E2E models, including CTC[1, 2], recurrent neural networktransducer (RNN-T),[3, 4] and attention based encoder-decoder(AED)[5, 6, 7], have gained more and more attention to speechrecognition over the last few years. Compared with the hybridASR framework, the most attractive merit of E2E models is theextremely simpliﬁed training procedure. Recent work [8, 9, 10]also shows that E2E systems have surpassed conventional hy-brid ASR systems in the standard of word error rate (WER).Considering the foregoing mentioned advantages of E2E mod-els, deploying the emerging ASR framework into real-worldproductions becomes necessary. However, deploying an E2Esystem is trivial and there are a lot of problems to be solved.First, the streaming problem . Streaming inference is es-sential for many scenarios that require the ASR system has torespond quickly with low latency. However, it is difﬁcult forsome E2E models to be streaming, such as AED, either great ef-fort is required or big accuracy loss is introduced to make suchmodel work in a streaming fashion[11, 12, 13].Second, unifying streaming and non-streaming modes .Uniﬁed streaming and non-streaming model can reduce the de-velopment effort, training cost, and deployment cost, especiallyfor small companies, which is also preferred for productionadoption[14, 15].Third, the production problem , which is also the most im-portant problem we care about during WeNet design. Great ef-forts are required to promote the E2E model into a real produc-tion application even though we already have a uniﬁed modelwhich has reasonable performance on both streaming and non- streaming applications. Firstly, we have to convert the researchmodel to a production model, which is painful for a dynamicgraph based deep learning toolkit, such as PyTorch[16]. Sec-ondly, we have to carefully design the inference workﬂow interms of the model architecture, applications and runtime plat-forms. For the model architecture, most E2E models ﬁrst doan encoder forward computation then an autoregressive beamsearch. The workﬂow is more complicated than a simple neu-ral network forward and problems become even more com-plicated if streaming is required. For both cloud and on-device applications, computation and memory cost should beseriously considered. Especially for on-device models, infer-ence optimization andmodel quantization play very importantroles. As for runtime platforms, athought there are variousplatforms could be used to do neural network inference, suchas ONNX (Open Neural Network Exchange), LibTorch in Py-torch, TensorRT[17], OpenVINO, MNN[18], and NCNN, it stillrequires both speech speciﬁc and advanced deep learning opti-mization knowledge to select the best one for your application.In this work, we present WeNet to address the above prob-lems. “We” in WeNet is inspired by “WeChat”, which meansconnection and share, “Net” is from Espnet[19] since we havereferred to a lot of the excellent design in Espnet. The main mo-tivation of WeNet is to close the gap between research and pro-duction of E2E speech recognition models, to reduce the effortof producing E2E models explore better E2E models for pro-duction. Therefore, WeNet is designed for production in nature,which makes our WeNet distinctly different from other toolkits.On the production ﬁrst and production ready principles,WeNet adopts the following implementations. First, WeNetadopts the Uniﬁed two Pass (U2)[20] framework to solve thestreaming and uniﬁed problems. Second, from model train-ing to deployment, WeNet only depends on PyTorch and it’secosystem. Furthermore, WeNet also provides a off-the-shellpipeline for both cloud server and on-device (Android) deploy-ment. The key advantages of WeNet are:1.

Production ﬁrst and production ready : The Pythoncode of WeNet meets the requirements of TorchScript,so the model trained by WeNet can be directly exportedby Torch JIT and use LibTorch for inference. There isno gap between the research model and the productionmodel. Neither model conversion nor additional code isrequired for model inference.2.

Uniﬁed solution for streaming and non-streamingASR : WeNet adopts the U2 framework to achieve an ac-curate, fast and uniﬁed E2E model, which is favorablefor industry adoption. a r X i v : . [ c s . S D ] F e b . Portable runtime : Several runtimes will be providedto show how to host WeNet trained models on differentplatforms, including server (x86) and embedded (ARMin Android platforms).4.

Light weight : WeNet is designed speciﬁcally for E2Espeech recognition with clean and simple code. It is allbased on PyTorch and its corresponding ecosystem. Ithas no dependencies on Kaldi, which simpliﬁes installa-tion and usage.As our experiments show, WeNet is a simple, accuratespeech recognition toolkit with an end-to-end solution from re-search to production.

2. Related Works

Espnet is the most popular open source platform for end-to-end speech research, it mainly focuses on end-to-end auto-matic speech recognition (ASR), and adopts widely-used dy-namic neural network toolkits Chainer and PyTorch as the maindeep learning engine. It provides E2E implementations includ-ing CTC, AED, RNN-T, and RNN language model rescoring.While Espnet is useful and widely-used for research, it is hardto directly use the model trained by Espnet in production, andthere is no production consideration or support in Espnet de-sign.Currently, there is no such toolkit that focuses onproduction-level E2E speech recognition.

3. WeNet

As we aim to address the streaming problem, the uniﬁed prob-lem, the production problem, and the solution should be simple,easy to implement and train, with good performance as well aseasy to be productized at runtime.U2, a uniﬁed two-pass gives a great solution to the prob-lems, as shown in Figure 1, which is a joint CTC/AED model.It consists of three parts, a

Shared Encoder , a

CTC Decoder , andan

Attention Decoder . The

Shared Encoder consists of multipleTransformer[21] or Conformer[22] encoder layers. The

CTCDecoder consists of a linear layer, which transforms the

SharedEncoder output to the CTC activation. The

Attention Decoder consists of multiple Transformer decoder layers. The

SharedEncoder only sees limited right contexts, and the

CTC Decoder runs in a streaming mode in the ﬁrst pass, and the

Attention De-coder is used in the second pass to give a more accurate result.

CTC Decoder Attention DecoderShared Encoder attention rescoring

Figure 1:

Two pass CTC and AED joint architecture 3.1.1. Training

A combined loss with CTC loss and AED loss is adoptedin training, as shown in Equation 1, where x is the acous-tic feature, y is the corresponding annotation, L CTC ( x , y ) , L AED ( x , y ) are the CTC and AED loss respectively, λ is a hy-perparameter which balances the importance of CTC and AEDloss. L combined ( x , y ) = λ L CTC ( x , y ) + (1 − λ ) L AED ( x , y ) (1)A dynamic chunk training technique is applied in the train-ing to unify the none streaming and streaming model and enablelatency control. First, The input is split into several chunks by aﬁxed chunk size C with inputs [t+1, t+2, ..., t+C], every chunkattends on itself and all the previous chunks, and the whole la-tency for the CTC Decoder in the ﬁrst pass only depends on thechunk size. If the chunk size is limited, it works in a stream-ing way, otherwise, it works in a non-streaming way. Second,the chunk size is varied dynamically from 1 to the max lengthof the current training utterance in the training, so the trainedmodel learns to predict with arbitrary chunk size. Empirically,a larger chunk size gives better results with higher latency, sowe can easily balance the accuracy and latency by tuning thechunk size at runtime.

For Python decoding in the research stage, to compare and eval-uate different parts of the joint CTC/AED model, WeNet sup-ports four decoding modes as follows:1. attention: apply standard autoregressive beam search onthe AED part of the model.2. ctc greedy search: apply CTC greedy search on the CTCpart of the model, CTC greedy search is super faster thanother modes.3. ctc preﬁx beam search: apply CTC preﬁx beam searchon the CTC part of the model, which can give the n-bestcandidates.4. attention rescoring: ﬁrst apply CTC preﬁx beam searchon the CTC part of the model to generate n-best candi-dates, and then rescore the n-best candidates on the AEDdecoder part with corresponding encoder output.For decoding in the runtime stage, the only atten-tion rescoring is supported since it’s our ultimate solution forproduction.

The overall design stack of WeNet is as Figure 2. The wholeframework is fully based on PyTorch and it’s an ecosystem asthe bottom stack, as what you will see in the following sec-tion, TorchScript is used for developing model, Torchaudio isused for on-the-ﬂy feature extraction, DistributedDataParallel isused for distributed training, torch JIT (Just In Time) is used formodel exportation, PyTorch quantization is used for the quan-tized model, and LibTorch is used for production runtime.The second stack consists of two parts. Python (Torch-Script) Research is for developing a research model, Torch-Script is used to ensure the model could be correctly exported asa production model. LibTorch Production is for hosting produc-tion model, which is designed to support various hardware andplatforms like CPU, GPU (CUDA) Linux, Android, and IOS. yTorch

Python(TorchScript) Research LibTorch ProductionRuntime: X86/AndroidData Prepare Training Decoding Export

Figure 2:

WeNet system design

The third stack shows typical research to productionpipeline in WeNet, the following sections will go through thedetailed design of these modules.

Data prepare in WeNet is pretty simple, a Kaldi style label ﬁleand wave list ﬁle, the model unit dictionary ﬁle which mapsmodel unit to corresponding integer id, are all you need. Thereis no need for any feature extraction in the data preparation stagesince we use on-the-ﬂy feature extraction in training.

The training stage in WeNet has the following key features:

On-the-ﬂy feature extraction : this is based on Torchaudio,which can generate the same FBANK feature as Kaldi. Sincethe feature is extracted on-the-ﬂy from the raw PCM data, wecan do data augmentation on raw PCM at a time level, frequencylevel, and ﬁnal feature level at the same time, which enrich thediversity of data and data augmentation.

Joint CTC/AED training : joint training speeds up the con-vergence of the training, improves the stability of the training,as well as gives better recognition results.

Distributed training : WeNet supports multiple GPUstraining with DistributedDataParallel in PyTorch.

A set of Python tools are provided to recognize the wave ﬁlesand compute the accuracy. These tools help users validate anddebug the model before deploying it in production. All the de-coding algorithms in Section 3.1.2 are supported.

For WeNet model is implemented in TorchScript, it could be ex-ported by torch JIT to the production model directly and safely.Then this exported model can be hosted by using the LibTorchlibrary in runtime. Both ﬂoat model and quantized int8 modelare supported. Using a quantized model could double the infer-ence speed or even more when hosted on Android devices.

Currently, we support hosting WeNet production model on twomainstream platforms, namely x86 as server runtime and An-droid as on-device runtime. A C++ API Library and runnabledemos for both platforms are provided. The User could also im-plement their customized system by using the C++ library. We carefully evaluated the three key metrics of the ASR system,namely accuracy, the real-time factor (RTF), and latency. Theperformance is suitable for many ASR applications such as ser-vice API and on-device voice assistants. The result is reportedin section4.2.

4. Experiments

We carry out our experiments on the open-source Chinese Man-darin speech corpus AISHELL-1[23], which contains a 150-hour training set, a 10-hour development set, and a 5-hour testset. The test set contains 7176 utterances in total. We use 80dimensional log Mel-ﬁlter bank (FBANK) computed on-the-ﬂyby Torchaudio with a 25ms window with a 10ms shift as the fea-ture. SpecAugment[24] is applied with 2 frequency masks withmaximum frequency mask ( F = 10 ), and 2 time masks withmaximum time mask ( T = 50 ). Two convolution sub-samplinglayers with kernel size 3*3 and stride 2 are used in the frontof the encoder, namely 4 times sub-sampling in all. We use 12transformer layers for the encoder and 6 transformer layers forthe decoder. Adam optimizer and a learning rate schedule with25000 warm-up steps are used in training. Moreover, we getour ﬁnal model by averaging the top-K best models which havea lower loss on the dev set during training. We ﬁrst trained a non-streaming model (M1) as our baseline,the model is trained and inferenced by full attention. Then wetrained a uniﬁed mode (M2) with a dynamic chunk strategy. M2is inferenced with different chunk sizes full/16/8/4 at decoding,full is for full attention non-streaming case, 16/8/4 is for thestreaming case.Table 1:

Uniﬁed model evaluation decoding method M1 M2 full 16 8 4attention 5.76 6.13 6.43 6.59 6.8ctc greedy search 6.21 6.75 7.85 8.41 9.44ctc preﬁx beam search 6.21 6.74 7.85 8.41 9.43attention rescoring 5.47 5.79 6.50 6.89 7.49First, as shown in Table 1, the uniﬁed model not only showscomparable results to the non-streaming model on the full at-tention case but also gives promising results on streaming casewith limited chunk size 16/8/4, which shows the effectivenessf dynamic chunk training strategy.Second, by comparing the four decoding modes, we cansee the attention rescoring mode always improves CTC resultsin both the non-streaming model and the uniﬁed model. Thectc greedy search and ctc preﬁx beam search have almost thesame performance, and they degrade signiﬁcantly as the chunksize decreases. While the attention mode degrades slightly, andthe attention rescoring mode alleviates the degradation of thectc preﬁx beam search results. And as the U2 paper shows,the attention rescoring mode is faster and it has a better RTFthan the attention mode since attention mode is an autoregres-sive procedure while the attention rescoring mode is not. Over-all, the attention rescoring not only shows promising results butalso has a lower RTF.So the dynamic chunk based uniﬁed model with atten-tion rescoring decoding is our choice for production, resultingin only the attention rescoring mode is supported in our run-time.

This section shows the quantization, RTF, and latency bench-mark on the uniﬁed model M2 described above. We do ourbenchmark on a server x86 platform, an on-device ARM An-droid platform respectively.For the cloud x86, the CPU is 4 cores Intel(R) Xeon(R)CPU E5-2620 v4 @ 2.10GHz, the memory is 16G in total. Onlyone thread is used for CPU threading and TorchScript inference for each utterance since the cloud service requires parallel pro-cessing, and a single thread avoids performance degradation inparallel processing.For the on-device Android, the CPU is 4 cores QualcommSnapdragon 865, the memory is 8.00GB. A single thread is usedfor the on-device inference. Here we just compare the CER difference before and after quan-tization. The RTF of quantization is shown in the followingsection. As in Table 2, the CER is comparable when quantiza-tion is applied. The CER of the ﬂoat model is slightly differentfrom what we listed in Table 1, since Table 1 is tested by Pythonresearch tools while the results here are tested by runtime tools.Table 2:

CER before and after quantization quantization/decoding chunk full 16 8 4NO (ﬂoat32) 5.87 6.49 6.88 7.46YES (int8) 5.89 6.54 6.89 7.51

As in Table 3, we can see the RTF increases as the chunk sizedecreases since the smaller chunk requires more iterations forthe forward computation. Further, quantization yields about 2times speedup on on-device (Android) and a slight improve-ment on the server (x86).

For latency benchmark, we create a WebSocket server/client tosimulate a real streaming application. This benchmark is only https://pytorch.org/docs/stable/notes/cpu threading torchscript inference.html Table 3:

RTF benchmark model/decoding chunk full 16 8 4server (x86) ﬂoat32 0.079 0.095 0.128 0.186server (x86) int8 0.072 0.081 0.098 0.134on-device (Android) ﬂoat32 0.164 0.251 0.350 0.505on-device (Android) int8 0.082 0.114 0.130 0.201carried out on the server x86 platform. The average latency weevaluated is described here:1. model latency (L1) : the wait time introduced by themodel structure. for our chunk based decoding, the aver-age wait time is half of the chunk. And the total modellatency of our model is ( chunk/ ∗ ∗ (ms),where 4 is the subsampling rate, 6 is the lookahead in-troduced by the ﬁrst two CNN layers in the encoder, 10is the frame shift.2. rescoring cost (L2) : the time cost on the second passattention rescoring.3. ﬁnal latency (L3) : the user (client) percieved latency,which is the time difference between the user stoppingspeaking and getting the recognition result. When ourASR server receives the speech stopping signal, it ﬁrstforwards the left speech for CTC searching, then doesthe second pass attention rescoring, so rescoring cost ispart of the ﬁnal latency. The network latency should bealso taken into account for a real production, howeversince we tested the server/client in the same machine, sothe network latency is negligible.Table 4: Latency benchmark decoding chunk L1 (ms) L2 (ms) L3 (ms)16 380 115 1428 220 115 1354 140 114 130As we see in Table 4, First, the rescoring costs are al-most the same for different chunk sizes, this is reasonable sincerescoring computation is invariant to chunk size. Second, the ﬁ-nal latency is dominated by the rescoring cost, which means wecan further reduce the ﬁnal latency by reducing the rescoringcost. Third, the ﬁnal latency increases slightly as the decodingchunk varies from 4 to 8 to 16.

5. Conclusions

We present a new open source E2E speech recognition toolkit,which is production ﬁrst and production ready, provides auniﬁed solution for streaming and non-streaming application,benchmarks the accuracy, RTF and latency. The whole toolkitis well designed, lightweight, and it shows great performance. . References. References