[PDF] Analysis of memory in LSTM-RNNs for source separation

Abstract

Long short-term memory recurrent neural networks (LSTM-RNNs) are considered state-of-the art in many speech processing tasks. The recurrence in the network, in principle, allows any input to be remembered for an indefinite time, a feature very useful for sequential data like speech. However, very little is known about which information is actually stored in the LSTM and for how long. We address this problem by using a memory reset approach which allows us to evaluate network performance depending on the allowed memory time span. We apply this approach to the task of multi-speaker source separation, but it can be used for any task using RNNs. We find a strong performance effect of short-term (shorter than 100 milliseconds) linguistic processes. Only speaker characteristics are kept in the memory for longer than 400 milliseconds. Furthermore, we confirm that performance-wise it is sufficient to implement longer memory in deeper layers. Finally, in a bidirectional model, the backward models contributes slightly more to the separation performance than the forward model.

Full PDF

AAnalysis of memory in LSTM-RNNs for sourceseparation

Jeroen Zegers and Hugo Van hamme ESAT, KU Leuven, Belgium

September 2, 2020

Abstract

Long short-term memory recurrent neu-ral networks (LSTM-RNNs) are consideredstate-of-the art in many speech processingtasks. The recurrence in the network, inprinciple, allows any input to be remem-bered for an indeﬁnite time, a feature veryuseful for sequential data like speech. How-ever, very little is known about which in-formation is actually stored in the LSTMand for how long. We address this problemby using a memory reset approach whichallows us to evaluate network performancedepending on the allowed memory timespan. We apply this approach to the taskof multi-speaker source separation, but itcan be used for any task using RNNs. Weﬁnd a strong performance eﬀect of short-term (shorter than 100 milliseconds) lin-guistic processes. Only speaker characteris-tics are kept in the memory for longer than400 milliseconds. Furthermore, we conﬁrmthat performance-wise it is suﬃcient to im-plement longer memory in deeper layers.Finally, in a bidirectional model, the back- ward models contributes slightly more tothe separation performance than the for-ward model.

Keywords— multi-speaker source separa-tion, long short-term memory, recurrent neuralnetworks, memory analysis

Deep learning has been dominant for manyspeech tasks in recent years [Hinton et al.,2012]. Since speech is a dynamic process, se-quential models like recurrent neural networks(RNNs) seem ideal for modeling the underly-ing process [Graves et al., 2013]. RNNs givestate-of-the-art performance in many speech[Bahdanau et al., 2016, Kolbæk et al., 2017]and non-speech [Sundermeyer et al., 2015, Shiet al., 2015] related tasks. While standardRNN cells are theoretically capable to remem-ber any input from the past and use this in-formation as it sees ﬁt, it is found that theirmemory time span (the duration for which in-formation is kept in memory) is actually rela-tively short. This is caused by vanishing andexploding gradients, which occur when the re-current network is unrolled through time for a r X i v : . [ ee ss . A S ] S e p he backpropagation step [Bengio et al., 1994,Mozer, 1992, Hochreiter et al., 2001], a prob-lem also observed in very deep networks.A solution to this gradient problem wasgiven with the introduction of the long short-term memory (LSTM) cell [Hochreiter andSchmidhuber, 1997]. The principle of the con-stant error carousel counters the vanishing andexploding gradients encountered with the reg-ular RNN cell [Hochreiter and Schmidhuber,1997]. It was shown that these LSTM-RNNssucceeded in solving simple, artiﬁcial taskswith long range dependencies of over a thou-sand time steps. However, it is unknown howlong the memory time span for LSTM cells isfor real and complex tasks like speech process-ing.The aim of this paper is to examine the timespan and importance of internal dynamics inthe LSTM memory. This will be done by re-setting the state of the LSTM cell at particulartime intervals as to limit the allowed memorytime span. By gradually reducing the resetfrequency, bigger memory time spans are al-lowed. The networks are evaluated for diﬀerentreset frequencies and the task performance dif-ferences can be used to assess the importanceof the diﬀerent memory spans. Furthermore,it is possible to use diﬀerent memory spans fordiﬀerent layers in the LSTM-RNN. This allowsto conﬁrm or reject the hypothesis that deeperlayers in RNNs bring higher level abstractionsof the data and therefore use a bigger time span[Chung et al., 2016]. Finally, diﬀerent memoryspans can be used for the forward and back-ward direction of a bidirectional LSTM-RNNwhich allows to distinguish between the impor-tance of the directions.The task of multi-speaker source separation(MSSS) seems well suited for this analysis asit has been shown that both long-term andshort-term eﬀects are important [Zegers andVan hamme, 2018]. This paper therefore fo- cuses on the MSSS task, but the proposedmethodology can be applied to any task usingRNNs. Speciﬁcally, we would like to answerthe following research questions with regardsto MSSS: • Which order of time spans are importantwhen using an LSTM-RNN for MSSS andcan we link these time spans with descrip-tions of speech like phonetics, phonotac-tics, lexicon, prosody and grammar? • Since it is has been shown that speakercharacterization is relevant for the task[Zegers and Van hamme, 2017, 2018], canwe ﬁnd the amount of context necessaryfor the LSTM-RNN to suﬃciently charac-terize the speakers in overlapping speech? • For MSSS, do we observe the same hier-archical property that deeper layers havelarger time dependencies, as was found byChung et al. [2016]? • In bidirectional LSTM-RNNs, would ei-ther direction be more important than theother for MSSS?The rest of this paper is organized as fol-lows. In section 2 an overview of related workwill be given and in section 3 the task of MSSSwill be explained, as well as how LSTM-RNNscan be used to tackle this problem. The mem-ory reset LSTM cell is introduced in section 4.The experimental setup is given in section 5and results are discussed in section 6. A ﬁnalconclusion is given in section 7.

To our knowledge Singh et al. [2016a] is theonly work where a similar reset approach toours is given. In their paper a multi-streamsystem with an LSTM component for video ction detection is described. A similar mem-ory reset approach is used, but only for a uni-directional single layer LSTM. In this paperwe extend to bidirectional multi-layer LSTMswhere we allow layer dependent reset periods(section 4.3) and a method to reduce computa-tional burden for longer reset periods (section4.4). This makes the state reset algorithm farmore complex compared to the unidirectionalsingle layer reset. Furthermore, the memoryreset by Singh et al. [2016a] was done duringtesting only, causing a train-test mismatch. Inthis work memory reset will be done duringtraining and testing.A diﬀerent approach, called the segment ap-proach in this paper for further reference, has asimilar aim to restrict the LSTM memory span[Mohamed et al., 2015, Chelba et al., 2017,T¨uske et al., 2018]. Segments are created byshifting a window by one time step over theinput data. Each segment is passed throughthe LSTM and the output of the LSTM at thelast time step within the segment is retained.The output of each segment has been producedwith an LSTM with a memory span equal tothe length of the window. It has been veriﬁedin our experiments that the reset approach andthe segment approach give the same results.Chelba et al. [2017] and T¨uske et al. [2018]used the segment approach on a language mod-eling task and claimed that perplexity scoresdid not improve by increasing the memoryspan over 40 time steps (words). Similarly,word error rates (WER) converged at 20 timesteps. In both works the number of diﬀerentmemory lengths evaluated was rather limitedand they were mainly interested in the perfor-mance saturation point. There was no analy-sis on the importance of diﬀerent time scaleswithin the model.The segment approach was ﬁrst applied byMohamed et al. [2015] on the spectral in-put of an automatic speech recognition (ASR) task. They found that the WER of the acous-tic LSTM-RNN saturated relatively quickly.Therefore it was concluded that the mainstrength of the RNN is the frame-by-frame pro-cessing rather than the ability to have a largememory span. However, we found that for thetask of MSSS long-term dependencies were, infact, important for the separation quality.A third approach to assess the relative im-portance of diﬀerent memory span lengths isthe leaky approach introduced by Zegers andVan hamme [2018]. There, a fraction of theLSTM cell state is leaked, on purpose, at everytime step. This forces the LSTM to forget in-formation of the past over time. The leaky ap-proach can be seen as a soft reset compared tothe hard reset of the reset approach proposedin this paper. The amount of computations ina leaky LSTM cell remains unchanged, whilethe computational load, for both the reset andsegment approach, scale with the width of thememory span. The leaky approach is thusmore interesting from a computational stand-point, but since the reset is soft, the timingsfound will not be exact.The above approaches focus on analysis ofthe memory span by limiting memory capabil-ities. There are multiple works that give noexplicit in-depth time analysis but adapt orrestrict the memory span of the RNN in or-der to improve performance on a task. Chunget al. [2016] and El Hihi and Bengio [1996] re-placed the soft reset of the forget gate with ahard reset implementation. The network triesto learn the optimal reset frequency. Other ap-proaches, like the clockwork RNN, have diﬀer-ent, ﬁxed update frequencies for diﬀerent cellsin the network. Memory restrictions are madeon a cell level, rather than on a network orlayer level. The idea is that some cells shouldfocus on long-term eﬀects and some should fo-cus on short-term [Koutn´ık et al., 2014, Alpayet al., 2016, Neil et al., 2016]. Multi-speaker sourceseparation

In this section a well known MSSS algorithmcalled Deep Clustering (DC) [Hershey et al.,2016] will be explained. The method is invari-ant to permutations in the order of the ref-erence speakers. It relies on intrinsic speakercharacterization by the network to consistentlymap time-frequency bins dominated by thesame speaker to the same point in the out-put space for a given mixture. At the end ofthe section, i-vectors, an explicit speaker char-acterization embedding often used in speakerrecognition tasks, will be given as an alterna-tive to this intrinsic speaker characterization.The i-vector of the active speakers can simplybe appended to the input data. In the resultsection it will be shown that this increases theseparation performance for short-term memoryspans since the network is unburdened from thespeaker characterization subtask.

When a mixture signal y [ n ] = (cid:80) Ss =1 x s [ n ] of S speakers is presented, the goal of MSSS isto estimate a signal ˆ x s [ n ] for the s th speakerthat is as close as possible to the source signal x s [ n ]. This task can be expressed in the time-frequency domain using the short-time Fouriertransform (STFT) of the signals. ˆ X s ( t, f )should then be estimated from Y ( t, f ) = (cid:80) Ss =1 X s ( t, f ). The inverse STFT (ISTFT)can be used to ﬁnd ˆ x s [ n ] from ˆ X s ( t, f ). Typi-cally, for each speaker a mask ˆ M s ( t, f ) is esti-mated such thatˆ X s ( t, f ) = ˆ M s ( t, f ) Y ( t, f ) , (1)for every time frame t = 0 , . . . , T − f = 0 , . . . , F −

1. One approach to address the MSSS task is to ﬁnd mappings g tf from the input mixture to the mask estimates:ˆ M s ( t, f ) = g tf ( Y ) , (2)with the constraint that ˆ M s ( t, f ) ≥ (cid:80) Ss =1 ˆ M s ( t, f ) = 1 for every time-frequencybin ( t, f ). In this paper g tf will be modeledwith an LSTM-RNN. A diﬀerentiable loss func-tion can be used to assess the quality of thespeech estimates L = S (cid:88) s =1 (cid:88) t,f D ( | ˆ X s ( t, f ) | , | X s ( t, f ) | ) , (3)with D some discrepancy measure. However,since an intra-class separation task is executedand no prior information on the speakers is as-sumed to be known, there is no guarantee thatthe network’s assignment of speakers is con-sistent with the speaker labels of the targets.This is referred to as the label ambiguity orpermutation problem [Hershey et al., 2016]. Tocope with this ambiguity, a loss function has tobe deﬁned that is independent of the order ofthe target speakers. DC uses such a permuta-tion invariant loss function. In DC, a D -dimensional embedding vector v tf is constructed for every time-frequency bin as v tf = g tf ( Y ), where v tf has unit length. A( T F × D )-dimensional matrix V is then con-structed from these embedding vectors. Simi-larly, a ( T F × S )-dimensional target matrix Z is deﬁned. If target speaker s is the dominantspeaker for bin ( t, f ), then z tf,s = 1, otherwise z tf,s = 0. Speaker s is dominant in a bin ( t, f )if s = arg max s (cid:48) ( | X s (cid:48) ( t, f ) | ). A permutation ndependent loss function is then deﬁned as L = (cid:107) VV T − ZZ T (cid:107) F = (cid:88) t ,f ,t ,f ( (cid:104) v t f , v t f (cid:105) − (cid:104) z t f , z t f (cid:105) ) , (4)where (cid:107) . (cid:107) F is the squared Frobenius norm.Since z tf is a one-hot vector, (cid:104) z t f , z t f (cid:105) = (cid:40) , if z t f = z t f , otherwise . (5)The ideal angle φ t f ,t f between the normal-ized vectors v t f and v t f is thus φ t f ,t f = (cid:40) , if z t f = z t f π/ , otherwise . (6)After estimating V , all embedding vectors areclustered into S clusters using K-means. Themasks are then constructed as followsˆ M s,tf = (cid:40) , if v tf ∈ c s , otherwise , (7)with c s a cluster from K-means. (1) can thenbe used to estimate the STFT of the originalsource signals. A speaker representation that is often used forspeaker identiﬁcation tasks is the i-vector [De-hak et al., 2011, Glembek et al., 2011]. Toobtain such an i-vector, ﬁrst a universal back-ground model, based on a Gaussian mixturemodel (GMM-UBM), is trained on develop-ment data. A supervector s is derived foreach utterance, using the UBM, by stackingthe speaker-adapted Gaussian mean vectors. s is then represented by an i-vector w and itsprojection based on the total variability space, s ≈ m + Tw , (8) Figure 1: Schematic of an LSTM cell.Based on Colah [2015]. where m is the UBM mean supervector, w isthe total variability factor or i-vector and T is a low-rank matrix spanning a subspace withimportant variability in the mean supervectorspace and is trained on development data [De-hak et al., 2011, Glembek et al., 2011].If such i-vectors are explicitly presented tothe input of the DC network, possibly less in-formation would have to be retained in theLSTM memory as there is no need for anintrinsic speaker characterization [Zegers andVan hamme, 2017, 2018]. It is noteworthy thatDrude et al. [2018] managed to show separa-tion quality improvement by adding an auxil-iary speaker identiﬁcation loss to the separa-tion loss. In this section we give a short summary of theLSTM before we describe the memory resetLSTM cell. Next we describe how this mem-ory reset LSTM cell can be used in an RNN.Derivations are given in A. Finally, we discusshow computational costs can be reduced by us-ing the grouped memory reset approach. T reset = 4. It keeps K = 4instances of the hidden unit and cell state,which can be reset according to (21). TheLSTM cell is used to process each instance. The regular LSTM cell is shown in ﬁgure 1 andis deﬁned in (9)–(14). f lt = σ ( W lf x lt + R lf h lt − + b f ) , (9) i lt = σ ( W li x lt + R li h lt − + b i ) , (10) o lt = σ ( W lo x lt + R lo h lt − + b o ) , (11) j lt = tanh( W lj x lt + R lj h lt − + b j ) , (12) c lt = c lt − (cid:12) f lt + j lt (cid:12) i lt , (13) h lt = tanh( c lt ) (cid:12) o lt , (14)with x lt , c lt and h lt the cell’s input, state andoutput (or hidden unit) respectively, at time t for layer l = 1 , . . . , L with L the number oflayers in the network. f lt , i lt and o lt are calledthe forget gate, the input gate and the outputgate, respectively. The input of an LSTM cellis the output from the layer below. x lt = h l − t , (15)for l = 2 , . . . , L . The ﬁrst layer receives theinput from the LSTM-RNN. The output of theLSTM-RNN is the output from the last layer h t = h Lt . (16)An LSTM-RNN can be made bidirectional. Inthat case the backward direction processes theinput from end to start. The t − t + 1. We use −→• todenote the hidden units in the forward direc-tion and ←−• for the backward direction. Thereare two ways to combine the outputs from theforward and backward direction: either afterevery layer or only after the last layer. If thelatter is chosen (15) becomes −→ x lt = −→ h l − t ←− x lt = ←− h l − t . (17)If inputs are instead combined after every layerwe get −→ x lt = (cid:32) −→ h l − t ←− h l − t (cid:33) ←− x lt = (cid:32) −→ h l − t ←− h l − t (cid:33) . (18)In both cases, for the ﬁnal output of the net-work, (16) becomes h t = (cid:32) −→ h Lt ←− h Lt (cid:33) . (19)In our experiments we found that it was bet-ter to combine outputs after every layer, evenif the total number of trainable parameters waskept unchanged. .2 Memory reset LSTM cell To limit the recurrent information, the cellstate c lt and hidden unit h lt will be reset usinga ﬁxed reset period T reset . This assures thatonly information of the last T reset frames canbe used (or T reset − T reset frames of information.To achieve this, K = T reset diﬀerent instancesof the cell state and hidden unit are kept, eachreset at diﬀerent moments in time. The in-stance that will be reset at time t is the in-stance k ∗ t for which k ∗ t = t mod K, (20)with t = 0 , . . . , T − (cid:16) ¯ h k,lt − , ¯ c k,lt − (cid:17) = (cid:40) ( , ) , if k = k ∗ t (cid:16) h k,lt − , c k,lt − (cid:17) , otherwise , (21)with k = 0 , . . . , K −

1. (9)–(14) are updatedto (22)–(27). f k,lt = σ ( W lf x k,lt + R lf ¯ h k,lt − + b f ) , (22) i k,lt = σ ( W li x k,lt + R li ¯ h k,lt − + b i ) , (23) o k,lt = σ ( W lo x k,lt + R lo ¯ h k,lt − + b o ) , (24) j k,lt = tanh( W lj x k,lt + R lj ¯ h k,lt − + b j ) , (25) c k,lt = ¯ c k,lt − (cid:12) f k,lt + j k,lt (cid:12) i k,lt , (26) h k,lt = tanh( c k,lt ) (cid:12) o k,lt . (27)A visualization for K = 4 is given in ﬁgure 2. For multi-layer memory reset LSTM-RNNs, in-stances need input from instances of the layerbelow. In other words, an equivalent for (15)has to be found for the memory reset LSTM-RNN. We introduce a new variable τ k,lt whichis equal to how long ago instance k of layer l was last reset (or how many context frames in-stance k of layer l considered at time t ). (35)shows that τ k,lt = ( t − k ) mod K. (28)The value of τ k,lt is color coded in ﬁgure 3 for K = 4. If an instance is colored light green,then τ k,lt = 0. If an instance is colored darkblue, then τ k,lt = 3 (= K − τ k,lt . The orange dashed line in ﬁgure3 shows that this way no information furtherthan K frames can be used. (36) shows thatfor a unidirectional memory reset LSTM-RNN,this is obtained when instance k receives inputfrom instance k from the layer below. (15) gen-eralizes to x k,lt = h k,l − t . (29)We introduce a new simpliﬁed notation k (cid:48) ← k (cid:48)(cid:48) , stating that instance k (cid:48) of layer l receivesinput from instance k (cid:48)(cid:48) of layer l −

1. In thisnotation (29) becomes k ← k . The ﬁnal outputof the network at time t is the instance with themaximum number of context frames at thattime. This is the instance that will be reset attime t + 1 (see (37)). (16) generalizes to h t = h k ∗ t +1 ,Lt . (30) The context is restricted at the edges of thedata sequence. τ k,lt can never exceed t . This isnot a restriction of the memory reset approach butintrinsic to the data sequence. L = 2, K = 4and T = 9. Color coding is according to (28). An instance is colored red when it isreset ( k = k ∗ t , according to (20)). Connections between layers are according to (29).The ﬁnal output follows (30). When following the orange dashed line, indicating thedata dependencies, backward, one can verify that indeed exactly K frames are used toproduce an output.Figure 4: Similar to ﬁgure 3 but for bidirectional memory reset LSTM-RNN. To preventcluttering of the image, only the forward direction of the second layer is shown andconnections are only drawn for t = { , , } .8 or bidirectional LSTM-RNNs we apply thesame equal context method to ﬁnd the follow-ing connections, replacing (18) −→ k ← (cid:32) −→ k (( T − − t + ←− k ) mod K (cid:33) , (31) ←− k ← (cid:32) ( − ( T −

1) + 2 t + −→ k ) mod K ←− k (cid:33) . (32)(31)-(32) are derived in (46)-(47) and can beveriﬁed in ﬁgure 4.The reset period T reset needs not be the samefor every layer. For instance, we could allowthe lower levels to operate on short-term in-formation and let the higher layers cope withthe long-term dependencies. By connecting aninstance with the correct instance of the previ-ous layer at every time t , we can still make surethe the number of context frames per layer islimited to a chosen value. (31)-(32) generalizeto (52)-(54). Until now, a diﬀerent instance is reset at ev-ery time step such that each instance is resetevery K time steps. The number of instances K is therefore equal to the reset period T reset and the computational requirements grow asthe reset period (or the memory span) becomeslarger. By using the reset operation only ev-ery G time steps, an instance will only be resetevery KG time steps and thus the reset periodbecomes T reset = KG . Then the number of in-stances can be reduced with a factor G , for thesame T reset . (20) is changed to (cid:40) k ∗ t = ( t/G ) mod K, if t ≡ G )no reset otherwise (33) A visualization of the grouped memory re-set approach is given in ﬁgure 5. The down-side is that instead of allowing the LSTM touse T reset = KG frames of input, it will use be-tween KG − ( G −

1) and KG frames of input,as shown in (65)-(66) (also see output layer inﬁgure 5). However this need not be a concern,since the computational problems arise only forlarge number of instances and then KG >> G .A similar approach was taken by El Hihi andBengio [1996] where frame grouping was usedto reduce computational burden. Derivationfor the connections between grouped memoryreset LSTM layers is given in B.

For the MSSS task, mixtures of two speakerswere used from the corpus introduced by Her-shey et al. [2016]. These mixtures were artiﬁ-cially created by mixing single speaker utter-ances from the Wall Street Journal 0 (WSJ0)corpus. A gain for the ﬁrst speaker comparedto the second speaker was randomly chosen be-tween 0 and 5 dB. Utterances were sampledat 8kHz and the length of the mixture waschosen equal to the shortest utterance in themixture to maximize the overlap. The train-ing and validation sets contained 20,000 and5,000 mixtures, respectively from 101 speak-ers, while the test set contained 3,000 mixturesfrom 16 held-out speakers. A STFT with a32 ms window length and a hop size of 8 mswere used, so the context span is deﬁned as T span = ( T reset − ∗ bss eval toolbox[Vincent et al., 2006]. In the experiments wemake a distinction between male-female mix-tures and same gender mixtures. The former L = 2, K = 2, G = 2 and T = 9. T reset = 4, just as in ﬁgure 3. Color coding is accordingto (56). An instance is colored red when it is reset ( k = k ∗ t , according to (33)). is regarded much easier than the latter.The memory reset approach was applied toa network of two fully connected bidirectionalLSTM-RNN layers, with 600 hidden unitseach. The reset is applied in both directions ofthe network, unless stated otherwise. Hiddenunits of both directions are concatenated be-fore being passed to the next layer, as was ex-pressed by (18). For DC the embedding dimen-sion was chosen at D = 20 and since the fre-quency dimension was F = 129, the total num-ber of output nodes was DF = 20 ∗

129 = 2580.Curriculum learning was applied by ﬁrst train-ing the networks on 100 frame segments, be-fore training on the full mixture [Bengio et al.,2009, Hershey et al., 2016]. The weights andbiases were optimized with the Adam learn-ing algorithm [Kingma and Ba, 2014] and earlystopping on the validation set was used. Thelog-magnitude of the STFT coeﬃcients wereused as input features and were mean and vari-ance normalized. Zero mean Gaussian noisewith standard deviation 0 . T reset , always twonetworks, with diﬀerent initializations, weretrained an tested to cope with variance on theevaluated performance. When T reset = ∞ , a regular LSTM-RNN instead of a memory re-set LSTM-RNN was used. All networks weretrained using TensorFlow [Abadi et al., 2016]and the code for all the experiments can befound online .For the i-vectors, a UBM and T ma-trix were trained on the Wall Street Jour-nal 1 (WSJ1) corpus, using the MATLAB MSRIdentity Toolbox v1.0 [Sadjadi et al., 2013].13-dimensional mel-frequency cepstral coeﬃ-cients (MFCCs) were used as features, theUBM had 256 Gaussian mixtures and the i-vectors were 10-dimensional, as was done byZegers and Van hamme [2017]. The i-vectorsused in the experiments were obtained fromthe original single speaker utterances of WSJ0but could also be obtained from speech signalreconstructions after source separation, as wasdone by Zegers and Van hamme [2017]. Theformer was chosen since it provides a cleanerspeaker representation.

We use the memory reset LSTM-RNN to gaininsights in the importance of the memory spanof the LSTM on the task performance. The github.com/JeroenZegers/Nabu-MSSS rst experiment (section 6.1) is solely to ver-ify that indeed for large T reset we can allow G >

In ﬁgure 6 the separation performance for net-works with diﬀerent memory time spans (with-out grouping) are given in blue. Notice thatno networks were trained for T span > K >

50. In orange, agroup factor of G = 5 (= 40ms) was applied.It is clear that a group factor of G = 5 canbe used for T span > >> T span ≤ T span > G = 5. Figure 7 shows the average separation perfor-mance for same gender (male-male and female-female) mixtures, with and without the i-vectors of both speakers appended to the in-put. Figure 8 shows the male-female resultsfor the same networks. Since the results areclearly diﬀerent, the ﬁgures will be discussedseparately.In ﬁgure 7 the blue curve quickly rises whenthe memory span is extended from 0ms to400ms. This eﬀect is also noted for the mod-els where i-vectors were appended to the input(orange curve). Here, the increase in perfor-mance cannot be explained by a better speakercharacterization, since the information is al-ready present in the i-vectors. Therefore, theincrease in performance of the blue curve can-not solely be explained by a better speakercharacterization. The separation task seemsto take phonetic information (about 100ms[Gay, 1968, Umeda, 1975]) into account. Fea-tures like common onset, common oﬀset andharmonicity playing a central role in auditorygrouping in humans [Bregman, 1994] are com-patible with this observed time scale as well.Information spanning several 100ms also seemsimportant. In this range, eﬀects like phonotac-tics, lexicon and prosody can play a role butfurther research is necessary to determine towhich extent each of these are individually im-portant for MSSS. In Appeltans et al. [2019]it was found that models trained on one lan-guage generalize to some extent to a diﬀerentlanguage, making it unlikely that lexical infor-mation is key for MSSS. If the memory spanis restricted to T span < G = 1), the orange curve usesa grouping factor of 5 ( G = 5). Every experiment was performed twice to cope withvariance on the evaluated performance. arate speakers with the same gender. Usingi-vectors helps to solve this problem.For T span > T span > T span = 0ms is far above the optimal result for same gendermixtures (ﬁgure 7). Instantaneous pitch andformant information seems to achieve most ofthe eﬀect. Male and female speakers are eas-ily separable, even without any context. Sincewe are interested in how the LSTM-RNN usesthis context, we will only report same genderseparation results in the remainder of this pa-per. However, we do observe that while thereis no signiﬁcant diﬀerence between the resetwith and without i-vectors for T span < T span > The orange curve in ﬁgure 9 shows the perfor-mance when memory reset is only applied tothe ﬁrst layer of the network. Naturally, per-formance is better compared to resetting bothlayers. However, it is interesting to note that optimal performance is already approximatelyachieved with a memory time span of less than50ms. This conﬁrms the hypothesis by Chunget al. [2016] that it is suﬃcient to allow largertime spans only in the deeper layers to modelthe higher level abstractions.

To our knowledge, there has been very lit-tle analysis on the relative importance be-tween the forward direction and backward di-rection of a bidirectional RNN. Early resultsfor a forward-only and backward-only modelare given in Schuster and Paliwal [1997] andGraves and Schmidhuber [2005]. Figure 10shows the diﬀerence in performance between abidirectional LSTM-RNN when memory resetis applied only on the forward direction (blue

Figure 10: Average separation resultsfor networks using diﬀerent memory timespans, evaluated on same gender mixtures.The blue curve shows the result when mem-ory reset is only applied to the forward di-rection and the backward direction layerhas no memory restrictions. The orangecurve applies memory reset only to thebackward direction. The results at inﬁnity,colored in black, use no memory reset. curve) compared to only on the backward di-rection (orange curve). For the blue curve,the networks evaluated at T span = 0ms, es-sentially correspond to a backward-only RNN.As T span is increased, more forward informa-tion is allowed but the backward direction re-mains dominant since it has no memory re-strictions. It is noted that at T span = 0ms,the backward-only LSTM-RNN slightly out-performs the forward-only LSTM-RNN. Thissmall but consistent diﬀerence is kept as thetime span for the non-dominant direction in-creases to 400ms.While looking for reasons that could ex-plain this diﬀerence, we found that speakersin WSJ0 ended their utterance with “period”,often taking a short break before pronounc-ing it . This leads to some asymmetry inthe speech activity as is shown in Figure 11,when measuring with a voice activity detector For instance, in the 4 th CHiME challenge [Vin-cent et al., 2016] this part is removed from the ut-terance. (VAD) [Tan and Lindberg, 2010]. After com-bining single speaker utterances to mixtures,this leads to less overlapping speech near theend of the mixture and might explain the dif-ference we observe in Figure 10. Furthermore,as “period” is pronounced at the end of ev-ery utterance, it might behave as a promptfor text-dependent speaker recognition [Vari-ani et al., 2014]. To exclude these unwantedeﬀects, the

LibriSpeech (LS) dataset [Panay-otov et al., 2015], which does not contain ver-bal punctuation, was used to artiﬁcially createmixtures . To ensure symmetry in speech ac-tivity, leading and trailing silence in the singlespeaker utterances were cut (see Figure 12 bot-tom). The forward-backward experiment wasrepeated on the newly created dataset and re-sults are shown in Figure 13. We see a simi-lar trend as for Figure 10 and retain our con-clusion that the backward direction is slightlymore important than the forward direction forMSSS.However, the question of what causes thisdiﬀerence remains unanswered. It seems to This dataset has also been used by other papersfor MSSS [Stephenson et al., 2017, Mobin et al.,2018]

Figure 12: Speech activity percentage ofclean LS utterance when audio length isnormalized to 1. For the original utterances(top) and when cutting leading and trailingsilence (bottom).Figure 13: Similar to Figure 10, but on theLS mixture dataset. suggest that cues in speech for MSSS are partlyasymmetric. It has been found that voice on-set time (VOT) is a predictive cue for post-aspiration [Klatt, 1975, Lisker and Abram-son, 1967], while similar conclusions have beendrawn for voice oﬀset time (VoﬀT) [Singhet al., 2016b, Pind, 1996]. Furthermore, it hasalso been observed that there is an acousticalasymmetry in vowel production [Patel et al.,2017]. Finally, reverberation could also play arole in a realistic cocktail party scenario, butthis is expected not to be relevant in our ex-periments, considering the recording setup forWSJ0 and LS. We leave it to further researchto indicate to which extend these asymmetriccues help in MSSS.Comparing the result without memory re- trictions (black colored results at T span = ∞ )with the backward constrained results (orangecurve) gives an indication on the SDR dropwhen only limited backward data is available.This can be relevant for near real-time appli-cations with limited allowed delay. We noticethat online implementations with limited de-lay (shorter than 100ms) loose roughly 1.5dBin SDR for same gender mixtures, compared tooﬄine implementations. A memory reset approach was developed andapplied to an LSTM-RNN to ﬁnd the impor-tance in diﬀerent time spans for the task ofMSSS. Short-term linguistic processes (timespans shorter than 100ms) have a strong im-pact on the separation performance. Above400ms the network can only learn betterspeaker characterization and other eﬀects likegrammar are not considered by the LSTMFurthermore, the reset method allowed usto verify that performance-wise it is suﬃcientto implement longer memory in deeper layers.Finally, we found that the backward directionis slightly more important than the forward di-rection for a bidirectional LSTM-RNN.The next step of this research would be touse the insights we have gained to adapt thearchitecture of the (LSTM-)RNN. We wouldlike to encourage other researchers to apply asimilar timing analysis for RNNs in their ﬁeld.Either with the leaky approach (straightfor-ward implementation, but no exact timings)or the memory reset or segment approach (lesstrivial implementation with higher computa-tional burdens, but assuring exact timings).Moreover, these methods allow to assess thememory implications on the RNN for a cer-tain subtask. For our task, the importanceof speaker characterization was determined bycomparing results with and without adding or- acle i-vectors to the input of the network. Thistechnique is generalizable to other tasks. Forinstance, in language modeling for French thegender of the subject must be remembered,possibly over many words, to conjugate theperfect tense accordingly. An oracle binary in-put (male/female) depending on the gender ofthe relevant subject could be provided. Com-paring results with and without this additionalbinary input, could give an idea on the impor-tance of this subtask on the memory of theRNN.

A Inter-layer connec-tions

A.1 Unidirectional resetLSTM-RNN

Given (20), instance k of layer l will be resetat times t k,l given by t k,l = k + αK, (34)with α a natural number. Therefore, the num-ber of time steps τ k,lt between time t and thelast time instance k was reset before time t isgiven by τ k,lt = ( t − t k,l ) mod K = ( t − k − αK ) mod K = ( t − k ) mod K. (35)A time t instance k of layer l contains τ k,lt frames of context. We would like this instanceto receive input from an instance of the layerbelow with the same number of context frames τ k,lt . In other words, we would like to ﬁnd theinstance that was reset at time t − τ k,lt in layer −

1. Using (20) we ﬁnd this to be k ∗ ,l − t − τ k,lt = ( t − τ k,lt ) mod K = ( t − ( t − k ) mod K ) mod K = k mod K = k. (36)This simply means that instance k from layer l should receive input from instance k from layer l −

1. Or in simpliﬁed notation k ← k . Finally,the last layer of the LSTM-RNN should outputa single instance which will be the ﬁnal outputof the network. We choose this to be the in-stance with the maximum number of contextframes τ max,Lt = K −

1. This means that theinstance was reset at t − τ max,Lt = t − K + 1.Thus the instance to select is k ∗ ,Lt − K +1 = ( t − K + 1) mod K = ( t + 1) mod K = k ∗ ,Lt +1 . (37) A.2 Bidirectional resetLSTM-RNN

For bidirectional LSTM-RNNs, we take a sim-ilar approach. Equivalent to (20), (34) and(35), we deﬁne −→ k ∗ ,lt = t mod K l , (38) t −→ k ,l = −→ k + αK l , (39) τ −→ k ,lt = ( t − −→ k ) mod K l . (40)For the backward direction we ﬁnd ←− k ∗ ,lt = (( T − − t ) mod K l , (41) t ←− k ,l = ( T − − ( ←− k + αK l ) , (42) τ ←− k ,lt = ( t −→ k ,l − t ) mod K l = (( T − − t − ←− k − αK l ) mod K l = (( T − − t − k ) mod K l . (43)Again, we want an instance to receive in-put from an instance in the layer below withthe same number of context frames. For theinstances in the forward direction these are −→ k ∗ ,l − t − τ −→ k ,lt and ←− k ∗ ,l − t + τ −→ k ,lt . When we use the samereset period for all layers ( K = K l = K l − ),these are −→ k ∗ ,l − t − τ −→ k ,lt = ( t − τ −→ k ,lt ) mod K = ( t − ( t − −→ k ) mod K ) mod K = −→ k mod K = −→ k . (44)and ←− k ∗ ,l − t + τ −→ k ,lt = (( T − − t − τ −→ k ,lt ) mod K = (( T − − t − ( t − −→ k ) mod K ) mod K = (( T − − t + −→ k ) mod K, (45)respectively. As per (18), in simpliﬁed notationthis becomes −→ k ← (cid:32) −→ k (( T − − t + ←− k ) mod K (cid:33) . (46)Similarly for the backward direction we ﬁndthe inputs to be ←− k ← (cid:32) ( − ( T −

1) + 2 t + −→ k ) mod K ←− k (cid:33) . (47)As per (19), the ﬁnal output of the networkis a concatenation of the output of both direc-tions of the last layer. We again choose these o be the instance with the maximum numberof context frames τ −−→ max,Lt = K − τ ←−− max,Lt = K − t − τ −−→ max,Lt = t − K + 1and t + τ ←−− max,Lt = t + K −

1. Thus the corre-sponding instances to select are −→ k ∗ ,Lt − K +1 = ( t − K + 1) mod K = ( t + 1) mod K = −→ k ∗ ,Lt +1 (48)and ←− k ∗ ,Lt + K +1 = (( T − − ( t + K − K = (( T − − ( t − K = ←− k ∗ ,Lt − . (49)Thus (19) generalizes to h t =  −→ h −→ k ∗ ,Lt +1 ,Lt ←− h ←− k ∗ ,Lt − ,Lt  . (50) A.3 Layer dependent reset pe-riod If K l and K l − are diﬀerent, we would still likean instance to receive input from the layer be-low with the same number of context frames.However, this is not possible when the num-ber of context frames exceeds the maximumnumber of context frames in the layer below(bounded by K l − − τ is introduced, which is deﬁned as¯ τ −→ k ,lt = min ( τ −→ k ,lt , K l − − . (51)When replacing τ ←− k ,lt with ¯ τ ←− k ,lt in (44) and(45), we get −→ k ← (cid:32) ( t − ¯ τ −→ k ,lt ) mod K l − (( T − − t − ¯ τ −→ k ,lt ) mod K l − (cid:33) . (52)Similarly, for the backward direction we deﬁne¯ τ ←− k ,lt = min ( τ ←− k ,lt , K l − − , (53) ←− k ← (cid:32) ( t − ¯ τ ←− k ,lt ) mod K l − (( T − − t − ¯ τ ←− k ,lt ) mod K l − (cid:33) . (54)With the constraint K l ≥ K l − (otherwise,layer l would be allowed less context than layer l −

1. Instances with τ k,l − t > K l − K l − would be set to K l ). B Grouped inter-layerconnections

Using (33), instance k of layer l will be resetat times t k,l given by t k,l = ( k + αK l ) G l . (55)Therefore, the number of time steps τ k,lt be-tween time t and the last time instance k wasreset before time t is given by τ k,lt = ( t − t k,l ) mod T l reset = ( t − kG l − αK l G l ) mod K l G l = ( t − kG l ) mod K l G l . (56)As before, we would like an instance to receiveinput from an instance in the layer below, withthe same number of context frames. However,this cannot be guaranteed as τ k,lt increases withsteps of G . For the forward input we thereforeselect the ﬁrst instance in layer l − t − τ k,lt or afterwards. (55) shows thatresets happen at multiples of G l − . Therefore he requested reset in the forward direction willhappen at time −→ γ −→ k ,lt =  t − ¯ τ −→ k ,lt G l −  G l − , (57)where ¯ τ is deﬁned as¯ τ −→ k ,lt = min ( τ −→ k ,lt , K l − G l − − . (58)Using (33), we ﬁnd that the instance in theforward direction in layer l − −→ γ −→ k ,lt is given by −→ k ∗ ,l − −→ γ −→ k ,lt = −→ γ −→ k ,lt G l − mod K l − =  t − ¯ τ −→ k ,lt G l −  G l − G l −  mod K l − =  t − ¯ τ −→ k ,lt G l −  mod K l − . (59)In the backward direction (33) is changed to ←− k ∗ ,lt = (cid:18) ( T − − tG l (cid:19) mod K l if ( T − − t ≡ G l ) . (60)The requested reset in the backward directionwill happen at time ←− γ −→ k ,lt = ( T − −  ( T − − ( t + ¯ τ −→ k ,lt ) G l −  G l −  . (61)Combining (60) and (61) gives the instance inthe backward direction in layer l − ←− γ −→ k ,lt . ←− k ∗ ,l − ←− γ −→ k ,lt =  ( T − − ←− γ −→ k ,lt G l −  mod K l − =  ( T − − ( t + ¯ τ −→ k ,lt ) G l −  mod K l − . (62)In shorthand notation this becomes −→ k ←  (cid:24) t − ¯ τ −→ k ,lt G l − (cid:25) mod K l − (cid:24) ( T − − t − ¯ τ −→ k ,lt G l − (cid:25) mod K l −  . (63)Similarly, for the backward direction we ﬁndthe inputs to be ←− k ←  (cid:24) t − ¯ τ ←− k ,lt G l − (cid:25) mod K l − (cid:24) ( T − − t − ¯ τ ←− k ,lt G l − (cid:25) mod K l −  . (64)Finally, we would like to ﬁnd the ﬁnal out-put of the network or a generalization of (50).Ideally, we would like to select the instanceswith T reset,L − K L G L − τ −→ k ,Lt and τ ←− k ,Lt increase with steps of G L .Instead we will be looking for time −→ γ −−→ max,Lt and ←− γ ←−− max,Lt deﬁned as −→ γ −−→ max,Lt = (cid:24) t − ( K L G L − G L (cid:25) G L (65)and ←− γ ←−− max,Lt = ( T − − (cid:18)(cid:24) ( T − − ( t + ( K L G L − G L (cid:25) G L (cid:19) . (66) he corresponding instances are thus −→ k ∗ ,L −→ γ −−→ max,Lt = −→ γ −−→ max,Lt G L mod K L = (cid:18)(cid:24) t − ( K L G L − G L (cid:25) G L G L (cid:19) mod K L = (cid:24) t − ( K L G L − G L (cid:25) mod K L . (67)and ←− k ∗ ,L ←− γ ←−− max,Lt = (cid:32) ( T − − ←− γ ←−− max,Lt G L (cid:33) mod K L = (cid:24) ( T − − ( t + ( K L G L − G L (cid:25) mod K L . (68) Funding

This research was funded by Research Founda-tion Flanders with grant number 1S66217N.

References

M. Abadi, A. Agarwal, P. Barham, E. Brevdo,Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, et al. Tensorﬂow:Large-scale machine learning on heteroge-neous distributed systems. arXiv preprintarXiv:1603.04467 , 2016.T. Alpay, S. Heinrich, and S. Wermter. Learn-ing multiple timescales in recurrent neuralnetworks. In

International Conference onArtiﬁcial Neural Networks , pages 132–139.Springer, 2016.P. Appeltans, J. Zegers, and H. Van hamme.Practical applicability of deep neural net-works for overlapping speaker separation. In

Interspeech 2019 , pages 1353–1357. ISCA,2019. D. Bahdanau, J. Chorowski, D. Serdyuk,P. Brakel, and Y. Bengio. End-to-endattention-based large vocabulary speechrecognition. In

IEEE International Confer-ence on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pages 4945–4949. IEEE,2016. doi: 10.1109/icassp.2016.7472618.Y. Bengio, P. Simard, and P. Frasconi. Learn-ing long-term dependencies with gradientdescent is diﬃcult.

IEEE transactions onneural networks , 5(2):157–166, 1994.Y. Bengio, J. Louradour, R. Collobert, andJ. Weston. Curriculum learning. In

Proceed-ings of the 26th annual international con-ference on machine learning , pages 41–48.ACM, 2009.A. S. Bregman.

Auditory scene analysis: Theperceptual organization of sound . MIT press,1994.C. Chelba, M. Norouzi, and S. Bengio. N-gram language modeling using recurrentneural network estimation. Technical report,Google, 2017. URL https://arxiv.org/abs/1703.10724 .J. Chung, S. Ahn, and Y. Bengio. Hierarchicalmultiscale recurrent neural networks. arXivpreprint arXiv:1609.01704 , 2016.C. Colah. Understanding LSTM Net-works. http://colah.github.io/posts/2015-08-Understanding-LSTMs/ , 2015.[Online; accessed 3-Juli-2019].N. Dehak, P. J. Kenny, R. Dehak, P. Du-mouchel, and P. Ouellet. Front-endfactor analysis for speaker veriﬁcation.

IEEE/ACM Transactions on Audio, Speechand Language Processing (TASLP) , 19(4):788–798, 2011. . Drude, T. von Neumann, and R. Haeb-Umbach. Deep attractor networks forspeaker re-identiﬁcation and blind sourceseparation. In IEEE International Confer-ence on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pages 11–15. IEEE, 2018.S. El Hihi and Y. Bengio. Hierarchical recur-rent neural networks for long-term depen-dencies. In

Advances in neural informationprocessing systems , pages 493–499, 1996.T. Gay. Eﬀect of speaking rate on diphthongformant movements.

The Journal of theAcoustical Society of America , 44(6):1570–1573, 1968.O. Glembek, L. Burget, P. Matˇejka,M. Karaﬁ´at, and P. Kenny. Simpliﬁca-tion and optimization of i-vector extraction.In

IEEE International Conference onAcoustics, Speech and Signal Processing(ICASSP) , pages 4516–4519, 2011.A. Graves and J. Schmidhuber. Frame-wise phoneme classiﬁcation with bidirec-tional lstm and other neural network archi-tectures.

Neural Networks , 18(5-6):602–610,2005.A. Graves, A.-r. Mohamed, and G. Hinton.Speech recognition with deep recurrent neu-ral networks. In

IEEE International Con-ference on Acoustics, Speech and Signal pro-cessing (ICASSP) , pages 6645–6649. IEEE,2013.J. R. Hershey, Z. Chen, J. Le Roux, andS. Watanabe. Deep clustering: Discrimina-tive embeddings for segmentation and sep-aration. In

IEEE International Conferenceon Acoustics, Speech and Signal Processing(ICASSP) , pages 31–35. IEEE, 2016.G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mo-hamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, and T. Sainath.Deep neural networks for acoustic modelingin speech recognition.

IEEE Signal Process-ing Magazine , 29:82–97, November 2012.S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.S. Hochreiter, Y. Bengio, P. Frasconi,J. Schmidhuber, et al. Gradient ﬂow in re-current nets: the diﬃculty of learning long-term dependencies. In J. F. Kolen and S. C.Kremer, editors,

A Field Guide to Dynami-cal Recurrent Neural Networks , chapter 14,pages 237–244. IEEE Press, 2001.D. Kingma and J. Ba. Adam: A methodfor stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.D. H. Klatt. Voice onset time, frication, andaspiration in word-initial consonant clusters.

Journal of Speech and Hearing Research , 18(4):686–706, 1975.M. Kolbæk, D. Yu, Z. Tan, and J. Jensen. Mul-titalker speech separation with utterance-level permutation invariant training of deeprecurrent neural networks.

IEEE/ACMTransactions on Audio, Speech and Lan-guage Processing (TASLP) , 25(10):1901–1913, 2017.J. Koutn´ık, K. Greﬀ, F. Gomez, and J. Schmid-huber. A clockwork rnn. In E. P.Xing and T. Jebara, editors,

Proceedingsof the 31st International Conference onMachine Learning , volume 32 of

Proceed-ings of Machine Learning Research , pages1863–1871, Bejing, China, 22–24 Jun 2014.PMLR. URL http://proceedings.mlr.press/v32/koutnik14.html .L. Lisker and A. S. Abramson. Some eﬀectsof context on voice onset time in english tops. Language and Speech , 10(1):1–28,1967. ISSN 0023-8309.J. L. Miller, F. Grosjean, and C. Lomanto. Ar-ticulation rate and its variability in sponta-neous speech: A reanalysis and some impli-cations.

Phonetica , 41(4):215–225, 1984.S. Mobin, B. Cheung, and B. Olshausen. Con-volutional vs. recurrent neural networks foraudio source separation. 2018.A.-r. Mohamed, F. Seide, D. Yu, J. Droppo,A. Stoicke, G. Zweig, and G. Penn. Deepbi-directional recurrent networks over spec-tral windows. In

Automatic Speech Recog-nition and Understanding (ASRU), 2015IEEE Workshop on , pages 78–83. IEEE,2015.M. C. Mozer. Induction of multiscale tempo-ral structure. In

Advances in neural infor-mation processing systems , pages 275–282,1992.D. Neil, M. Pfeiﬀer, and S.-C. Liu. Phasedlstm: Accelerating recurrent network train-ing for long or event-based sequences. In

Advances in Neural Information ProcessingSystems , pages 3882–3890, 2016.V. Panayotov, G. Chen, D. Povey, and S. Khu-danpur. Librispeech: an asr corpus basedon public domain audio books. In

IEEE In-ternational Conference on Acoustics, Speechand Signal Processing (ICASSP) , pages5206–5210. IEEE, 2015.R. R. Patel, K. Forrest, and D. Hedges. Re-lationship between acoustic voice onset andoﬀset and selected instances of oscillatoryonset and oﬀset in young healthy men andwomen.

Journal of Voice , 31(3):389.e9–389.e17, 2017. J. Pind. Rate dependent perception of aspi-ration and pre aspiration in icelandic.

TheQuarterly Journal of Experimental Psychol-ogy: Section A , 49(3):745–764, 1996.S. O. Sadjadi, M. Slaney, and L. Heck. MSRidentity toolbox v1.0: A MATLAB toolboxfor speaker-recognition research.

Speech andLanguage Processing Technical CommitteeNewsletter , 1(4), November 2013. URL http://research.microsoft.com/apps/pubs/default.aspx?id=205119 .M. Schuster and K. K. Paliwal. Bidirectionalrecurrent neural networks.

IEEE Trans-actions on Signal Processing , 45(11):2673–2681, 1997.X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. WOO. Convolutionallstm network: A machine learning approachfor precipitation nowcasting. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 28 , pages802–810. Curran Associates, Inc., 2015.B. Singh, T. K. Marks, M. Jones, O. Tuzel,and M. Shao. A multi-stream bi-directionalrecurrent neural network for ﬁne-grained ac-tion detection. In

Computer Vision and Pat-tern Recognition (CVPR) , pages 1961–1970.IEEE, 2016a.R. Singh, J. Keshet, D. Gencaga, and B. Raj.The relationship of voice onset time andvoice oﬀset time to physical age. In

IEEE In-ternational Conference on Acoustics, Speechand Signal Processing (ICASSP) , volume2016-, pages 5390–5394. IEEE, 2016b. ISBN9781479999880.C. Stephenson, P. Callier, A. Ganesh, andK. Ni. Monaural audio speaker separationwith source contrastive estimation. arXivpreprint arXiv:1705.04662 , 2017. . Sundermeyer, H. Ney, and R. Schluter.From feedforward to recurrent lstmneural networks for language mod-eling. IEEE/ACM Transactions onAudio, Speech, and Language Process-ing (TASLP) , 23(3):517529, 2015. doi:10.1109/taslp.2015.2400218.Z.-H. Tan and B. Lindberg. Low-complexityvariable frame rate analysis for speech recog-nition and voice activity detection.

IEEEJournal of Selected Topics in Signal Process-ing , 4(5):798–807, 2010.Z. T¨uske, R. Schl¨uter, and H. Ney. Investi-gation on lstm recurrent n-gram languagemodels for speech recognition. In

Interspeech2018 , pages 3358–3362. ISCA, 2018.N. Umeda. Vowel duration in american en-glish.

The Journal of the Acoustical Societyof America , 58(2):434–445, 1975.E. Variani, X. Lei, E. McDermott, I. L.Moreno, and J. Gonzalez-Dominguez. Deepneural networks for small footprint text-dependent speaker veriﬁcation. In

IEEE In-ternational Conference on Acoustics, Speechand Signal Processing (ICASSP) , pages4052–4056. IEEE, 2014.E. Vincent, R. Gribonval, and C. F´evotte.Performance measurement in blind audiosource separation.

IEEE/ACM Transactionson Audio, Speech and Language Processing(TASLP) , 14(4):1462–1469, 2006.E. Vincent, S. Watanabe, J. Barker, andR. Marxer. The 4th chime speech sep-aration and recognition challenge.

URL:http://spandh. dcs. shef. ac. uk/chime chal-lenge { Last Accessed on 1 August, 2018 } ,2016. J. Zegers and H. Van hamme. Improvingsource separation via multi-speaker repre-sentations. In Interspeech 2017 , pages 1919–1923. ISCA, 2017.J. Zegers and H. Van hamme. Memory timespan in lstms for multi-speaker source sep-aration. In

Interspeech 2018 , pages 1477–1481. ISCA, 2018., pages 1477–1481. ISCA, 2018.