[PDF] Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Abstract

Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3% across several unseen stream combinations.

Full PDF

TTWO-STAGE AUGMENTATION AND ADAPTIVE CTC FUSION FOR IMPROVEDROBUSTNESS OF MULTI-STREAM END-TO-END ASR

Ruizhi Li , Gregory Sell , , Hynek Hermansky , Center for Language and Speech Processing, The Johns Hopkins University, USA Human Language Technology Center of Excellence, The Johns Hopkins University, USA

ABSTRACT

Performance degradation of an Automatic Speech Recognition(ASR) system is commonly observed when the test acoustic con-dition is different from training. Hence, it is essential to makeASR systems robust against various environmental distortions,such as background noises and reverberations. In a multi-streamparadigm, improving robustness takes account of handling a vari-ety of unseen single-stream conditions and inter-stream dynamics.Previously, a practical two-stage training strategy was proposedwithin multi-stream end-to-end ASR, where Stage-2 formulates themulti-stream model with features from Stage-1 Universal FeatureExtractor (UFE). In this paper, as an extension, we introduce atwo-stage augmentation scheme focusing on mismatch scenarios:Stage-1 Augmentation aims to address single-stream input varietieswith data augmentation techniques; Stage-2 Time Masking appliestemporal masks on UFE features of randomly selected streamsto simulate diverse stream combinations. During inference, wealso present adaptive Connectionist Temporal Classiﬁcation (CTC)fusion with the help of hierarchical attention mechanisms. Exper-iments have been conducted on two datasets, DIRHA and AMI,as a multi-stream scenario. Compared with the previous trainingstrategy, substantial improvements are reported with relative worderror rate reductions of . − . across several unseen streamcombinations. Index Terms — Multi-Stream, Robustness, Two-Stage Aug-mentation, Adaptive CTC Fusion

1. INTRODUCTION

The multi-stream paradigm of speech processing has been an activeresearch area, in which parallel information sources are simultane-ously considered for knowledge fusion. A robust fusion strategyis crucial to reliably address a variety of scenarios with differentdynamics across streams. As one inspiration, the idea of parallelprocessing in human auditory systems has successfully motivateddevelopments of various multi-stream frameworks in hybrid ASR[1, 2, 3, 4]. For instance, multi-band acoustic modeling [3, 4] wasproposed to improve noise robustness for a speech recognizer. Per-formance measures were introduced to select the most informativesource in spatial acoustic scenes for hearing aids [5] or determinethe quality of the model outputs [6]. Multi-modal approaches com-bined visual [7] or symbolic [8] inputs together with speech signalsto improve speech recognition. This work concentrates on the set-ting of multiple far-ﬁeld microphone arrays, e.g., meeting rooms ordomestic scenarios. The common methods of combing multiple ar-rays in conventional ASR are posterior combination [9, 10], ROVER[11], distributed beamformer [12], and selection based on Signal-to-Noise/Interference Ratio (SNR/SIR) [13]. The multi-stream end-to-end framework was present in previousstudies [14, 15], in which the MEM-Array model was introduced formulti-array applications. It is a single neural network that takes mul-tiple inputs and directly outputs word/letter sequences. This frame-work was proposed based on a joint CTC/Attention E2E scheme[16, 17, 18], where each stream is characterized by a separate en-coder and CTC network. A Hierarchical Attention Network (HAN)[15, 19] acts as a fusion component to dynamically guide the sys-tem towards streams carrying more discriminative information. Apractical two-stage training strategy was introduced later in [20]. InStage-1, an Universal Feature Extractor (UFE) is optimized withoutrequiring parallel data; Stage-2 formulates a multi-stream model di-rectly on the UFE features with focus on solely training the HANcomponent.The previous two-stage training strategy [20] offers a promis-ing direction to further improve the robustness of multi-stream sys-tems. It involves augmentation of training data, with an emphasison single-stream variations in Stage-1 and inter-stream dynamics inStage-2. Moreover, in [20], pre-deﬁned equal CTC contributionsduring inference can potentially confuse the decoding procedure, es-pecially when acoustic conditions among streams are dramaticallydifferent.In this paper, we present a two-stage augmentation scheme andadaptive CTC fusion targeting the aforementioned situations. Theproposed techniques have the following highlights:1. Stage-1 Augmentation aims to train a well-generalized en-coder so that the resulting UFE features could be robustagainst different unseen stream conditions. Both online aug-mentation (SpecAugment [21]) and ofﬂine augmentationapproaches are explored. Stage-2 Time Masking applies tem-poral masks on the UFE features. It provides a simple onlineaugmentation technique to create inter-stream dynamics.2. Adaptive CTC fusion applies the stream fusion vector to theCTC networks in the decoding step. CTC contributions thenchange dynamically depending on the HAN component, in-stead of the previous approach of pre-ﬁxed weights.

2. MULTI-STREAM END-TO-END FRAMEWORK

In this section, we review the MEM-Array model, one representa-tive framework of the multi-stream approach with focus on far-ﬁeldmicrophone arrays. An efﬁcient two-stage training strategy is alsodiscussed.

An end-to-end ASR model addressing general multi-stream scenar-ios was proposed in [14] within the joint CTC/Attention architecture. a r X i v : . [ c s . S D ] F e b s one realization of the multi-stream approach, the MEM-Arraymodel can take as input parallel streams from several distant micro-phone arrays. We denote a T ( i ) -length sequence of D -dimensionalspeech vectors as X ( i ) = { x ( i ) t ∈ R D | t = 1 , , ..., T ( i ) } ,where superscript i ∈ { , ..., N } is the index for i -th stream.The MEM-Array model directly maps N information sources, X = { X (1) , X (2) , ..., X ( N ) } , into an L -length label sequence, C = { c l ∈ U| l = 1 , , ..., L } . Here U is a set of distinct labels.In the MEM-Array model, multiple microphone arrays are ac-tivated by separate encoders with identical architectures to capturediverse information. The Encoder ( i ) operates on the acoustic se-quence X ( i ) to extract a set of higher-level feature representations H ( i ) = { h ( i )1 , ..., h ( i ) (cid:98) T ( i ) /s (cid:99) } , where s is the subsampling factor de-ﬁned by the encoder architecture.Two levels of attention mechanisms are designated to combinethe different views. A frame-level attention mechanism is assignedto each encoder to obtain the stream-speciﬁc speech-label alignment.A location-based attention network [22] is applied to compute theletter-wise context vector r ( i ) l for stream i : r ( i ) l = (cid:88) (cid:98) T ( i ) /s (cid:99) t =1 a ( i ) lt h ( i ) t , (1)where a ( i ) lt is the attention weight, a soft-alignment of h ( i ) t for output c l . A hierarchical stream-level attention mechanism then handlesdifferent dynamics across the streams. The fusion context vector r l is computed in a content-based attention network [22]: r l = (cid:88) Ni =1 β ( i ) l r ( i ) l , (2) β ( i ) l = HierarchicalAttention ( q l − , r ( i ) l ) , i ∈ { , ..., N } . (3)The Softmax output β ( i ) l represents a stream-level attention weightfor stream i of letter prediction c l .Moreover, a separate CTC network is designated for each en-coder. Per-encoder CTC modules have pre-deﬁned equal contribu-tions for joint training and decoding. In the beam search, the CTCpreﬁx score [18, 23] α ctc ( h ) of hypothesized sequence h is as fol-lows: α ctc ( h ) = 1 N (cid:88) Ni =1 α ctc ( i ) ( h ) , (4)where equal weight is assigned to each CTC network. With an increasing number of streams (encoders) involved, jointlytraining a massive network requires substantial memory and vastamounts of parallel data. A two-stage training strategy was presentin [20] to tackle the aforementioned issues, depicted in Fig. 1. Thisstrategy resulted in performance improvements while efﬁcientlyscaling the training procedure.Firstly, Stage-1 focuses on training a single-stream ASR modelusing various data with no presumption of parallel streams. Thewell-optimized encoder, which is referred to as Universal FeatureExtractor (UFE), is used to further process acoustic frames from in-dividual streams to generate UFE features, { H (1) , H (2) , ..., H ( N ) } .In addition, byproducts in Stage-1, such as decoder, CTC and frame-level attention, are used for initialization in Stage-2. Secondly,Stage-2 formulates a multi-stream architecture directly operatingon the UFE features as inputs with no highly-parameterized par-allel encoders involved. We deﬁne the streams used for Stage-2training as target streams. With pre-trained components frozen dur-ing optimization, the model concentrates solely on the stream-levelattention. Fig. 1 . Two-Stage Training Strategy [20]. Color “green” indicatesthe components are trainable; Color “blue” means parameters of thecomponents are frozen.

3. PROPOSED APPROACHES FOR ROBUSTNESS

As an extension of the two-stage training strategy, in this section,we present a two-stage argumentation scheme and an adaptive CTCfusion to improve robustness of the MEM-Array model.

Following the framework of the two-stage training strategy de-scribed in Fig. 1, the proposed two-stage augmentation schemedeﬁnes individual steps to simulate single-stream variations andinter-stream dynamics, respectively.

The goal of Stage-1 training is to obtain a set of UFE features withmore discriminative power for Stage-2 prediction. With limitedamount of data for target streams in Stage-2, data augmentationin Stage-1 is a strategy to create more data with a diverse set ofconditions and also to involve audio from non-target arrays. In thiswork, we explore two approaches to improve training with dataaugmentation in the multi-array scenario:• SpecAugment [21] is an online augmentation technique thatdegrades input on the ﬂy in the training mini-matches. Itviews the spectrogram as a visual representation, and mod-iﬁes the spectrogram by warping it in the time direction andapplying masks in frequency and time.• The second approach is ofﬂine augmentation that generatesextra data before training. In the multi-stream framework,we conduct experiments with either simulated audio or realrecordings from non-target streams. In DIRHA [24], severalreverberated versions of clean speech are generated using pre-measured room impulse responses; In AMI [25], recordingsfrom close-talk microphones in addition to microphone arraysare used for Stage-1 training. .1.2. Stage-2 Time Masking

Stage-2 augmentation aims to improve a multi-stream model’s ro-bustness against variations in inter-stream dynamics. For instance,the model needs to learn how to reliably handle the situation if oneof the arrays suddenly fails in a meeting setting. Since the UFE fea-tures are the direct inputs for Stage-2, we consider augmentation onUFE features instead of log-Mel ﬁlter bank features.In this work, we introduce Stage-2 Time Masking, a simple buteffective method to create differences across the streams. Inspired bytemporal masking in SpecAugment, Stage-2 Time Masking masksthe UFE features in time for individual streams. For each utteranceduring training, a pre-deﬁned number of time masks are placed onthe UFE features. The mask will replace the value of the originalUFE features with the ﬁlled mask value within the masking region.The applied location and duration of a mask are both randomly cho-sen from a uniform distribution. Note that Stage-2 Time Masking isapplied only during training. The time mask is utterance-speciﬁc, inthat it replaces the features with the mean value of the UFE featuresfor that utterance.The Stage-2 Time Masking is intended to mimic the situation ofa partial loss of a speech segment for one of the streams. Comparedwith augmentation at the acoustic level, Stage-2 Time Masking iscomputationally easy to apply with no additional data.

In the previous study [20], the CTC component of each stream waspre-trained in Stage-1 and kept frozen in Stage-2 for training. Duringinference in a multi-stream setting, equal decoding weights acrossall streams were assigned to the CTC components in Eq. 4. Thesepre-deﬁned CTC weights could be problematic if one array is in anacoustic condition that is signiﬁcantly worse than the others.In this work, we propose adaptive CTC fusion during decodingto mitigate the problem above using the knowledge from hierarchicalattention mechanism. For every prediction, the hierarchical attentionnetwork produces an attention vector [ β (1) l , β (2) l , ..., β ( N ) l ] across allstreams, which steers the system to more informative streams. Sincea label-synchronous beam search is employed during inference, eachCTC component produces a preﬁx score, α ctc ( i ) ( h ) , for a hypothe-sized sequence h . Instead of taking average of stream-speciﬁc preﬁxscores for overall CTC contribution of hypothesis h , we calculate theweighted average contributions from individual CTCs. The streamattention vector can be combined with CTC preﬁx scores α ctc ( h ) fora hypothesized sequence h : α ctc ( h ) = (cid:88) Ni =1 β ( i ) l ∗ α ctc ( i ) ( h ) , (5)where adaptive stream weight β ( ∗ ) l is applied to each CTC networkand l is the index of the latest prediction of hypothesis h .

4. DATA

Two datasets, DIRHA English WSJ [24] and AMI meeting corpus[25], were used for experiments and analysis.The DIRHA English WSJ corpus focuses on the challenge ofspeech interactions via distributed microphones in a domestic envi-ronment. There are in total 32 microphones placed in an apartmentwith a living room and a kitchen. In our experiments, we chosetwo microphone arrays, Beam Linear Array (BLA) and Beam Cir-cular Array (BCA), and ﬁve single microphones (depicted in Fig.

Fig. 2 . DIRHA English WSJ Microphone Conﬁguration. Streamsselected are in red circles. Beam Circular Array contains 6 micro-phones (LA1-LA6), Beam Linear Array includes 11 microphones(LD02-LD12).2) for use in either training or evaluation. Training data was cre-ated by contaminating the original Wall Street Journal clean speech(WSJ0 and WSJ1, 81 hours in total) with room impulse responsesfor corresponding streams. The development set for cross validationwas simulated with typical domestic background noise and rever-beration. For evaluation, read WSJ utterances were newly recordedsimultaneously by all 32 channels in a real setting. In addition, wecreated a synthetic test stream,

NoMic , to replicate the scenario ofsignal cut-off, where inputs are all zeros after mean and variancenormalization.The AMI meeting corpus was created in three instrumentedrooms with meeting conversations. Each meeting room was con-ﬁgured with two microphone arrays and close-talk microphonesfor individual speakers, resulting in 100 hours of far-ﬁeld signal-synchronized recordings. With segments of overlapping speakersremoved, the training, development and evaluation set contain 81hours, 9 hours and 9 hours of meeting recordings, respectively.Table 1 summarizes the stream descriptions used in subsequent ex-periments. For stream

IHM , the close-talk microphone with the mostenergy among all attendees was selected at each time frame. In con-trast, stream

IHM0 always took speech from speaker-0, regardlessof if speaker-0 was speaking. Similar to the DIRHA setup,

NoMic was created to mimic constant microphone dropout.For each array in both datasets, multi-channel input was synthe-sized into a single-channel audio using the Delay-and-Sum beam-forming technique with the BeamformIt Toolkit [26].

5. EXPERIMENT SETUP

All the experiments were conducted using the Pytorch backend onESPnet [27]. Table 2 describes the relevant setup information forthe various experiments. Two model conﬁgurations were explored:

Conﬁg-1 included two BLSTM layers in the encoder and one LSTMlayer in the decoder. A more complex model with

Conﬁg-2 had anadditional two BLSTM layers and an extra LSTM layer as well. Weused 50 distinct labels including 26 English letters and other specialtokens, i.e., punctuation and sos/eos. A look-ahead word-level RNN-LM [28] was incorporated during inference. It was trained separately able 1 . AMI Meeting Corpus Stream Conﬁguration.Stream DescriptionMDM ﬁrst microphone arraySMDM second microphone arrayIHM individual headset microphonesIHM0 individual headset microphones(ﬁxed speaker-0 for each meeting)NoMic constant stream dropout (all-zero inputs)using Stochastic Gradient Descent (SGD) for 20 epochs.

Table 2 . Experimental Conﬁguration.

Feature

Model

Encoder type VGGBLSTM [17, 29] (subsampling factor: 4)Encoder layers Conﬁg-1: 6(CNN)+2(BLSTM)Conﬁg-2: 6(CNN)+4(BLSTM)Encoder units 320 cells (BLSTM layers)Encoder projection 320 cells (BLSTM layers)Frame-level Attention 320-cell Content-basedStream Attention 320-cell Location-basedDecoder type LSTMDecoder layers 1 (Conﬁg-1) or 2 (Conﬁg-2)Decoder units 320 cells

Train and Decode

Optimizer AdaDeltaBatch size 30 (Stage-1); 15 (Stage-2)Training Epoch 30 epochs (patience:3 epochs)CTC weight λ RNN-LM

Type Look-ahead Word-level RNNLM [28]Size 1-Layer LSTM with 1,000 cellsVocabulary 65,000Train data AMI:AMI; DIRHA:WSJ0-1+extra WSJ textLM weight γ AMI:0.5; DIRHA:1.0

SpecAugment [21]

Time mask T : 40Frequency mask F : 30

6. RESULTS AND DISCUSSIONS6.1. Stage-1 Augmentation

To investigate the effectiveness of Stage-1 augmentation, we eval-uated online and ofﬂine augmentation techniques on DIRHA andAMI datasets. Table 3 illustrates Stage-1 single-stream results usingthe proposed augmentation schemes. With the each model conﬁgu-ration, substantial Word Error Rate (WER) reductions were reportedwith SpecAugment, i.e., D1 v.s. D3 and D2 v.s. D4 . Moreover, themore complex network Conﬁg-2 did not necessarily improve overthe smaller model

Conﬁg-1 until augmentation was utilized in train-ing (i.e., D1 outperformed D2 , but D4 outperformed all earlier mod-els). We created additional reverberated copies of clean WSJ data us-ing room impulse responses measured for four single microphones,i.e., L1L , L2L , L3L and

L4L . D11 achieved better WERs across sixstreams compared to

D5-D10 . More importantly,

D11 , trained with all six streams, outperformed D4 on the BCA and

BLA evaluations,showing the value of the additional out-of-set data. From here,

D11 was selected as the Stage-1 model for the remaining DIRHA exper-iments.

Table 3 . Stage-1 Augmentation: DIRHA English WSJ. Model size(2, 1) and (4, 2) represent

Conﬁg-1 and

Conﬁg-2 in Table 2. (%WER)

Train Model Test DataID Data SpecAug Size BCA BLA L1L L2L L3L L4LD1 BCA+BLA No (2,1) 33.9 30.7 – – – –D2 BCA+BLA No (4,2) 34 32 – – – –D3 BCA+BLA Yes (2,1) 27.1 24.4 – – – –D4 BCA+BLA Yes (4,2) 24.9 22.6 – – – –D5 BCA Yes (4,2) 27.1 – – – – –D6 BLA Yes (4,2) – 27.7 – – – –D7 L1L Yes (4,2) – – 28.3 – – –D8 L2L Yes (4,2) – – – 35.4 – –D9 L3L Yes (4,2) – – – – 33 –D10 L4L Yes (4,2) – – – – – 30.4D11 All Streams Yes (4,2)

Table 4 summarizes Stage-1 augmentation results of AMI in asimilar way to Table 3. It was clear looking at

A1-A4 that onlineaugmentation (SpecAugment) consistently decreased error rates. In-cluding additional close-talk stream

IHM , A8 showed lower WERscomparing to A4 . From here, A8 was utilized for AMI Stage-2 train-ing. Table 4 . Stage-1 Augmentation: AMI. (% WER)

Train Model Test DataID Data SpecAug Size MDM SMDM IHMA1 MDM+SMDM No (2,1) 56.9 61.7 –A2 MDM+SMDM No (4,2) 53.1 58.3 –A3 MDM+SMDM Yes (2,1) 50.3 54.9 –A4 MDM+SMDM Yes (4,2) 46.1 50.5 –A5 MDM Yes (4,2) 50.5 – –A6 SMDM Yes (4,2) – 55.5 –A7 IHM Yes (4,2) – – 30.4A8 All Streams Yes (4,2)

In previous study [20], each CTC network in the multi-stream set-ting contributed equally during inference. These pre-deﬁned CTCweights could cause performance degradation if one of streams iscorrupted. We designed simple experiments in DIRHA to illustratethis issue. After Stage-1, we formulated a two-stream model usingtarget streams,

BLA and

NoMic for training and testing. Since

BLA was known to be the only informative source, stage-1 performance of . for BLA was viewed as the best possible result. In Table 5, the

Oracle

Stage-2 decoding setup with CTC weights [1 .

0; 0 . achievedWER of . , essentially equivalent to the single-stream perfor-mance. However, WER increased to . when equal weightswere applied. The proposed adaptive CTC fusion made the modelmore robust with the help of stream attention, reaching Stage-1 per-formance of . without any pre-existing knowledge of the rela-tive value of the streams. able 5 . Issues with Pre-deﬁned CTC Weights. (% WER)Model Test Stage-1: BLA only

D11 in Table 3 17.2

Stage-2: BLA-NoMic

Pre-deﬁned CTC Weights [1.0; 0.0] 17.3Pre-deﬁned CTC Weights [0.5; 0.5] 20.5Adaptive CTC Fusion

To show the inﬂuence of adaptive CTC fusion in matched conditions,we conducted experiments with different two-stream acoustic condi-tions. In each experiment, training and evaluation data were drawnfrom the same arrays. Results are displayed in Table 6. In orderto pick diverse conditions in DIRHA, three two-stream conﬁgura-tions were chosen,

BLA-L2L , BLA-BCA and

L3L-L4L . According tothe Stage-1 performance,

BLA was most informative single stream.

BCA / L2L were the most similar/different streams to

BLA in terms ofWER.

L3L and

L4L resulted the same WER of . . For AMI,all three conbinations of the three streams were selected. WER im-provements were observed across all six cases in the two datasets.From the analysis of the case BLA-L2L , we observed increasing per-centage of improved utterances ( . → . ). Note that thecase of improved utterances describes the situation where WER froma multi-stream model is the same as or lower than the best singlestream WERs. Table 6 . Adaptive CTC Fusion in Matched Conditions. (% WER)

Decoding Strategy Train/Test Data

DIRHA

BLA-L2L BLA-BCA L3L-L4L

Pre-deﬁned CTC [0.5; 0.5] 17.2 16.5 20.4Adaptive CTC Fusion

MDM-SMDM MDM-IHM SMDM-IHM

Pre-deﬁned CTC [0.5; 0.5] 42 29.3 29.8Adaptive CTC Fusion

For the following experiments, we designated

BLA-L2L and

MDM-SMDM as the training stream conﬁgurations for DIRHA and AMI,respectively. In DIRHA, three mismatched test conditions werechosen:

BLA-NoMic and

BLA-KA6 were the unseen scenarios whereone stream (

BLA ) is known to greatly outperform the other. Notethat

KA6 (Stage-1 WER: ) was a microphone in the kitchenwhile speakers read in the living room;

L3L-L4L were the micro-phones with the same Stage-1 performances. We speciﬁed twomismtached condtions for AMI:

MDM-NoMic and

MDM-IHM0 .Recall

IHM0 (Stage-1 WER: . ) is the close-talk microphoneattached to speaker-0. In DIRHA, results in Table 7 reported moder-ate improvement except BLA-NoMic , which sees a modest decline.Stream

NoMic is an extreme case and may be too aggressive as aunseen test stream. For AMI, relative WER reductions of . and . were shown for the mismatched conditions. Table 7 . Adaptive CTC Fusion in Mismatch Conditions. (% WER)

Decoding Strategy Test Data

DIRHA (BLA-L2L)

BLA-NoMic BLA-KA6 L3L-L4L

Pre-deﬁned CTC [0.5; 0.5]

21 20.3Adaptive CTC Fusion 27.1

MDM-NoMic MDM-IHM0 –Pre-deﬁned CTC [0.5; 0.5] 46.1 44 –Adaptive CTC Fusion – To demonstrate another potential weakness of the previous MEM-Array system, we designed experiments in DIRHA to demonstratepotential performance degradation because of a mismatched testcondition, as depicted in Table 8.

BLA-L2L and

BLA-NoMic wereused to train and test two Stage-2 models. While the matched con-ditions on the diagonal of Table 8 exhibited reasonable results, themodel trained with

BLA-L2L is unable to handle the unseen condi-tion

BLA-NoMic , degrading by nearly absolute WER decreasecomparing to Stage-1

BLA performance, . . Table 8 . Evaluation in Matched and Mismatched Conditions. (%WER) Test DataModel

BLA-L2L BLA-NoMic

Stage-2

BLA-L2L

Stage-2

BLA-NoMic

BLA-L2L (DIRHA) and

MDM-SMDM (AMI), respec-tively. During training, a time mask was created with the lengthuniformly sampled from [0 , (in frames). Note that 10 frames ac-counted for 0.4 second due to subsampling. The mask was appliedin a randomly selected position on the UFE features.We experimented with different numbers of time masks. ForDIRHA experiments, a model trained with 3 time masks per streamgave the optimal results. In particular, substantial absolute WERimprovement of . were seen when evaluating BLA-NoMic , pre-sumably because it is essentially the situation that stage-2 augmenta-tion is simulating. WER on

BLA-KA6 also decreased while keepingother conditions unchanged or slightly improved. In AMI experi-ments, Stage-2 augmentation kept all test conditions under similarperformances. It is likely that the AMI model could already handlethese unseen conditions properly. For instance, without Stage-2 timemasking,

MDM-NoMic achived a WER of . which is close tothe Stage-1 MDM performance of . . The number of masks isset to be 3 for DIRHA and 1 for AMI based on matched conditionperformances.For comparison, input dropout on the UFE features was imple-mented. Stage-2 Time Masking constantly obtained lower WERs inall conditions, which supported the idea of creating stream dynamicsinstead of unit dropout over the inputs. able 9 . Stage-2 Time Masking. (% WER) Model Test Data

DIRHA

BLA-L2L BLA-NoMic BLA-KA6 L3L-L4L

Stage2 BLA-L2L 16.9 27.1 20.7 20- Input Dropout 0.2 17.7 38.1 22.1 20.6- Input Dropout 0.5 19.2 21 23.6 22.6- Time Masking (

MDM-SMDM MDM-NoMic MDM-IHM0 –Stage2 MDM-SMDM 41.6 43.1 41.9 –- Input Dropout 0.2 42.3 44.5 42.6 –- Input Dropout 0.5 45.2 49.5 46.3 –- Time Masking ( –- Time Masking (

Generally, parallel data are more expensive to collect. In this section,we examined how much parallel data could be sufﬁcient for Stage-2model training with a reasonable performance. We used

BLA-L2L in DIRHA for this demonstration. As described in Table 10, 1 hourdata per stream could maintain fair WERs with only an average of . performance degradation, indicating a relatively low burdenfor data resources. Table 10 . Discussion on Amount of Parallel Data. (% WER)

Training Data Test Data(Hours)

BLA-L2L BLA-NoMic BLA-KA6 L3L-L4L

Table 11 summarizes the contributions of each proposed step, in thiscase using

BLA-L2L and

MDM-SMDM as the training stream conﬁg-urations for DIRHA and AMI, respectively, while test data includesmatched and mismatched conditions. Stage-1 augmentation togetherwith a more complex model consistently reduced the WERs. Adap-tive CTC fusion and Stage-2 time masking provided notable im-provements in various scenarios. Overall, compared to the previoustraining strategy [20], we observed average relative WER reductionsof . (DIRHA) and . (AMI). In particular, substantial rela-tive WER improvement of . − . was reported across severalmismatched stream conditions. For fair comparison, we also evalu-ated the model where the HAN component was replaced by ﬁxedstream fusion weights [0 .

5; 0 . for fusion of context vectors. Inthese cases, the components, including CTC, frame-level attentionand decoder, were optimized during Stage-2. Our proposed modelgreatly outperformed the model with no stream attention.To visualize the effect of the stream attention, Fig. 3 shows at-tention plots of two examples from evaluation set MDM-IHM0 inAMI. In the ﬁrst example, (a)-(c), speaker-0 was speaking, and asa result both

MDM and

IHM0 were informative sources, and thestream attention in (c) gave weights to both inputs, though shiftedslightly towards

IHM0 since this close-talk stream had better speech

Table 11 . Overall Results. (% WER)

Model Test Data

DIRHA (BLA-L2L)

BLA-L2L BLA-NoMic BLA-KA6 L3L-L4L

Two-Stage Training 27.4 43.8 37.9 29.7+ Large Model 28 57.4 37.8 29.7+ Stage-1 Augment. 17.2 26.9 21 20.3+ Adaptive CTC Fusion 16.9 27.1 20.7 20+ Stage-2 Time Masking

No Stream Attention 36.2 66.5 49.4 37.2

AMI (MDM-SMDM)

MDM-SMDM MDM-NoMic MDM-IHM0 –Two-Stage Training 55.5 69 59.2 –+ Large Model 52 62 55.1 –+ Stage-1 Augment. 42 46.1 44 –+ Adaptive CTC Fusion 41.6 43.1 41.9 –+ Stage-2 Time Masking –No Stream Attention 56 69.7 65.8 – (a) MDM (b) IHM0 (c) MDM-IHM0(d) MDM (e) IHM0 (f) MDM-IHM0

Fig. 3 . Sentence Analysis of Attention Mechanism during Infer-ence. Example 1 (speaker-0 speaking) includes (a),(b),(c); Example2 (speaker-0 not speaking) includes (d),(e),(f). (a) and (d) are frame-wise attention alignments of

MDM ; (b) and (e) are frame-wise atten-tion alignments of

IHM0 ; (c) and (f) are stream attention weights of

MDM-IHM0 .quality. In the second example, (d)-(f), speaker-0 was not speak-ing and so another speaker’s audio was recorded by

MDM while

IHM0 could barely capture any speech. In this case, the stream fu-sion mechanism correctly attended to

MDM with nearly con-ﬁdence.

7. CONCLUSION

In this work, we presented a two-stage augmentation scheme andadaptive CTC fusion for the purpose of improving robustness of themulti-stream end-to-end model against diverse testing conditions.Inherited from the two-stage training strategy, the two-stage aug-mentation consistently improved performance across matched andmismatched conditions; adaptive CTC fusion enhances the robust-ness by applying stream attention weights dynamically. For futureresearch, stream-speciﬁc knowledge could be used for a more cus-tomized stage-2 training, and more sophisticated attention mecha-nisms could be explored for stream fusion. . REFERENCES [1] Hynek Hermansky, “Multistream recognition of speech: Deal-ing with unknown unknowns,”

Proc. of the IEEE , vol. 101, no.5, pp. 1076–1088, 2013.[2] Hynek Hermansky, “Coding and decoding of messages in hu-man speech communication: Implications for machine recog-nition of speech,”

Speech Communication , 2018.[3] Sri Harish Reddy Mallidi,

A Practical and Efﬁcient Mul-tistream Framework for Noise Robust Speech Recognition ,Ph.D. thesis, Johns Hopkins University, 2018.[4] Sri Harish Mallidi and Hynek Hermansky, “Novel neural net-work based fusion for multistream asr,” in

Proc. of ICASSP .IEEE, 2016, pp. 5680–5684.[5] Bernd T Meyer, Sri Harish Mallidi, Angel Mario Castro Mar-tinez, Guillermo Pay´a-Vay´a, Hendrik Kayser, and Hynek Her-mansky, “Performance monitoring for automatic speech recog-nition in noisy multi-channel environments,” in

Proc. of SLT .IEEE, 2016, pp. 50–56.[6] Ruizhi Li, Gregory Sell, and Hynek Hermansky, “Performancemonitoring for end-to-end speech recognition,” in

Proc. of IN-TERSPEECH , 2019.[7] Shruti Palaskar, Ramon Sanabria, and Florian Metze, “End-to-end multimodal speech recognition,” in

Proc. of ICASSP .IEEE, 2018, pp. 5774–5778.[8] Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner,and Shinji Watanabe, “Multi-modal data augmentation for end-to-end asr,”

Proc. Interspeech 2018 , pp. 2394–2398, 2018.[9] Xiaofei Wang, Ruizhi Li, and Hynek Hermansky, “Stream at-tention for distributed multi-microphone speech recognition,”in

Proc. of INTERSPEECH , 2018, pp. 3033–3037.[10] Feifei Xiong, Jisi Zhang, Bernd T Meyer, Heidi Christensen,and Jon Barker, “Channel selection using neural network pos-terior probability for speech recognition with distributed mi-crophone arrays in everyday environments,” in

CHiME-5 ,2018.[11] Jonathan G Fiscus, “A post-processing system to yield re-duced word error rates: Recognizer output voting error reduc-tion (rover),” in

Proc. of ASRU . IEEE, 1997, pp. 347–354.[12] Takuya Yoshioka, Dimitrios Dimitriadis, Andreas Stolcke,William Hinthorn, Zhuo Chen, Michael Zeng, and XuedongHuang, “Meeting transcription using asynchronous distant mi-crophones,” in

Proc. Interspeech 2019 , 2019, pp. 2968–2972.[13] Jun Du, Yan-Hui Tu, Lei Sun, Feng Ma, Hai-Kun Wang, JiaPan, Cong Liu, Jing-Dong Chen, and Chin-Hui Lee, “The ustc-iﬂytek systems for chime-5 challenge,” in

CHiME-5 , 2018.[14] Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe,Takaaki Hori, and Hynek Hermansky, “Multi-stream end-to-end speech recognition,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 28, pp. 646–655, 2019.[15] Xiaofei Wang, Ruizhi Li, Sri Harish Mallidi, Takaaki Hori,Shinji Watanabe, and Hynek Hermansky, “Stream attention-based multi-array end-to-end speech recognition,” in

Proc. ofICASSP . IEEE, 2019, pp. 7105–7109.[16] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-tasklearning,” in

Proc. of ICASSP , 2017, pp. 4835–4839. [17] Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan,“Advances in joint CTC-attention based end-to-end speechrecognition with a deep CNN encoder and RNN-LM,” in

Proc.of INTERSPEECH , 2017.[18] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey,and Tomoki Hayashi, “Hybrid ctc/attention architecture forend-to-end speech recognition,”

IEEE Journal of Selected Top-ics in Signal Processing , vol. 11, no. 8, pp. 1240–1253, 2017.[19] Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori,Shinji Watanabe, and Hynek Hermansky, “Multi-encodermulti-resolution framework for end-to-end speech recogni-tion,” arXiv preprint arXiv:1811.04897 , 2018.[20] Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, andHynek Hermansky, “A practical two-stage training strategy formulti-stream end-to-end speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 7014–7018.[21] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu,Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “Specaugment:A simple data augmentation method for automatic speechrecognition,”

Proc. of INTERSPEECH , 2019.[22] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-els for speech recognition,” in

Proc. of NIPS , pp. 577–585.Curran Associates, Inc., 2015.[23] A Graves, “Supervised sequence labelling with recurrent neu-ral networks [ph. d. dissertation],”

Technical University of Mu-nich, Germany , 2008.[24] Mirco Ravanelli, Piergiorgio Svaizer, and Maurizio Omologo,“Realistic multi-microphone data simulation for distant speechrecognition,” in

Proc. of INTERSPEECH , 2016.[25] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn,Mael Guillemot, Thomas Hain, Jaroslav Kadlec, VasilisKaraiskos, Wessel Kraaij, Melissa Kronenthal, et al., “Theami meeting corpus: A pre-announcement,” in

Proc. of MLMI .Springer, 2005, pp. 28–39.[26] Xavier Anguera, Chuck Wooters, and Javier Hernando,“Acoustic beamforming for speaker diarization of meetings,”

IEEE Transactions on Audio, Speech, and Language Process-ing , vol. 15, no. 7, pp. 2011–2022, 2007.[27] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta So-plin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al.,“Espnet: End-to-end speech processing toolkit,” in

Proc. ofINTERSPEECH , 2018, pp. 2207–2211.[28] Takaaki Hori, Jaejin Cho, and Shinji Watanabe, “End-to-endspeech recognition with word-based rnn language models,”in .IEEE, 2018, pp. 389–396.[29] Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wies-ner, Sri Harish Mallidi, Nelson Yalta, Martin Karaﬁat, ShinjiWatanabe, and Takaaki Hori, “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning,and language modeling,” in