Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations with Subwords
Prashanth Gurunath Shivakumar, Panayiotis Georgiou, Shrikanth Narayanan
CCONFUSION2VEC 2.0: ENRICHING AMBIGUOUS SPOKEN LANGUAGEREPRESENTATIONS WITH SUBWORDS
Prashanth Gurunath Shivakumar, Panayiotis Georgiou, Shrikanth Narayanan
University of Southern California, Los Angeles, California, USA
ABSTRACT
Word vector representations enable machines to encode hu-man language for spoken language understanding and pro-cessing. Confusion2vec, motivated from human speech pro-duction and perception, is a word vector representation whichencodes ambiguities present in human spoken language in ad-dition to semantics and syntactic information. Confusion2vecprovides a robust spoken language representation by consid-ering inherent human language ambiguities. In this paper,we propose a novel word vector space estimation by unsu-pervised learning on lattices output by an automatic speechrecognition (ASR) system. We encode each word in confu-sion2vec vector space by its constituent subword charactern-grams. We show the subword encoding helps better rep-resent the acoustic perceptual ambiguities in human spokenlanguage via information modeled on lattice structured ASRoutput. The usefulness of the proposed Confusion2vec repre-sentation is evaluated using semantic, syntactic and acousticanalogy and word similarity tasks. We also show the benefitsof subword modeling for acoustic ambiguity representationon the task of spoken language intent detection. The resultssignificantly outperform existing word vector representationswhen evaluated on erroneous ASR outputs. We demonstratethat Confusion2vec subword modeling eliminates the need forretraining/adapting the natural language understanding mod-els on ASR transcripts.
Index Terms — Confusion2Vec, Subword, Word VectorRepresentation, Word Embedding, Spoken Language Under-standing
1. INTRODUCTION
Speech is the primary and most natural mode of communi-cation for humans. This makes its use also attractive for hu-man computer interaction, which in turn requires decodinghuman language to enable spoken language understanding.Human language is a complex construct involving multipledimensions of information including semantics, syntax andoften contain ambiguities which make it challenging for ma-chine inference of communication intent, emotions etc. Sev-eral word vector representations have been proposed for ef- email: [email protected] fectively describing the human language in the natural lan-guage processing community.Contextual modeling techniques like language modeling,i.e., predicting the next word in the sentence given a windowof preceding context, have been shown to model meaningfulword representations [2, 18]. Bag-of-word based contextualmodeling, where the current word is predicted given both itsleft and right (local) contexts has shown to capture languagesemantics and syntax [19]. Similarly, predicting local contextfrom the current word, referred to as skip-gram modeling, isshown to better represent semantic and syntactic distancesbetween words [20]. In [21] log bi-linear models combin-ing global word co-occurrence information and local contextinformation, termed as global vectors (GloVe), is shown toproduce meaningful structured vector space. Bi-directionallanguage models are proposed in [22], where internal statesof deep neural networks are combined to model complexcharacteristics of word use and its variance over linguisticcontexts. The advantages of bi-directional modeling are fur-ther exploited along with self-attention using transformernetworks [31] to estimate a representation, termed as BERT(Bidirectional Encoder Representations from Transformers),that has shown its utility on a multitude of natural languageunderstanding tasks [6]. Models such as BERT, ELMo esti-mate word representations that vary depending on the context,whereas the context-free representations including GloVe andWord2Vec generate a single representation irrespective of thecontext.However, most of the word vector representations inferthe knowledge through contextual modeling and many of theinherent ambiguities present in human language are often un-recognized or ignored. For instance, from the perspectiveof spoken language, the ambiguities can be associated withhow similar the words sound, i.e., for example, the words“see” and “sea” sound acoustically identical but have differ-ent meanings. The ambiguities can also be associated with theunderlying speech signal itself due to wide range of acousticenvironments involving noise, overlapped speech and chan-nel, room characteristics. These ambiguities often projectthemselves as errors through ASR systems. Most of the ex-isting word vector representations such as word2vec [20, 19],fasttext [3], GloVe [21], BERT [6], ELMo [22] do not accountfor the ambiguities present in speech signals and thus degrade a r X i v : . [ c s . C L ] F e b hile processing the output of noisy ASR transcripts.Confusion2vec was recently proposed to handle repre-sentation ambiguity information present in human language[26]. Confusion2vec is estimated by unsupervised skip-gramtraining on the ASR output lattices and confusion networks.The analysis of inherent acoustic ambiguity information ofthe embeddings displayed meaningful interactions betweenthe semantic-syntactic subspace and acoustic similarity sub-spaces. In [27], the usefulness of the Confusion2vec wasconfirmed on the task of spoken language intent detection.The Confusion2vec representation significantly outperformedtypical word embeddings including word2vec and GloVewhen evaluated on noisy ASR transcripts by reducing theclassification error rate by approximately 20% relative.Although, there have been few attempts in leveraginginformation present in word lattices and word confusion net-works for several tasks [29, 14, 30, 33, 28, 13], the maindownside with these works is that the word representationestimated by such techniques are task dependent and arerestricted to a particular domain and dataset. Moreover,availability of most of the task specific datasets are limitedand task specific speech data are expensive to collect. Theadvantage of Confusion2Vec is that it estimates a generic,task-independent word vector representation via unsuper-vised learning on lattices or confusion networks generated byan ASR on any speech conversations.In this paper, we incorporate subwords to represent eachword for modeling both the acoustic ambiguity informationand the contextual information. Each word is modeled as asum of constituent n-gram characters. Our motivation behindthe use of subwords are the following: (i) they incorporatesmorphological information of the words by encoding internalstructure of words [3], (ii) the bag of character n-grams oftenhave a high overlap between acoustically ambiguous words,(iii) subwords help model under-represented words more ef-ficiently, thereby leading to more robust estimation with lim-ited available data, which is the case since training Confu-sion2Vec is restricted to ASR lattice outputs, (iv) subwordsenable representations for out-of-vocabulary words which arecommon-place with end-to-end ASR systems outputting char-acters.The rest of the paper is organized as follows: Confu-sion2vec is introduced in Section 2. The proposed subwordmodeling is presented in Section 3. Section 4 gives detailsof the evaluation techniques employed for assessing the wordembedding models. The experimental setup and results ofvarious analogy and similarity tasks are presented in sec-tion 5. Section 6 presents the application of the proposedword vector representation to the spoken language intentdetection task. Finally, the paper is concluded in section 7.
2. CONFUSION2VEC
In psycho-acoustics, it is established that humans also re-late words with how they sound [1] in addition to semanticsand syntax. Inspired by principles of human speech produc-tion and perception, we previously proposed Confusion2vec[26]. The core idea is to estimate a hyper-space that not onlycaptures the semantics and syntax of human language, butalso augments the vector space with acoustic ambiguity in-formation, i.e., word acoustic similarity information. In otherwords, word2vec, GloVe can be viewed as a subspace of theconfusion2vec vector space.Several different methodologies are proposed for captur-ing the ambiguity information. The methodologies are anadaptation of the skip-gram modeling for word confusion net-works or lattice-like structures. The word lattices are directedacyclic weighted graphs of all the word sequences that arelikely possible. A confusion network is a specific type of lat-tice with constraints that each word sequence passes througheach node of graph. Such lattice-like structures can be de-rived from machine learning algorithms that output probabil-ity measures, for example, an ASR. Figure 1, illustrates a con-fusion network that can possibly result from a speech recogni-tion system. Unlike typical simple sentences which are usedfor training word embeddings like word2vec, GloVe, BERT,ELMo etc., the information in the confusion network can beviewed along two dimensions: (i) contextual dimension, and(ii) acoustic ambiguity dimension.More specifically, four different configurations of skip-gram modeling algorithms are proposed in our previous work[26], namely: (i) top-confusion, (ii) intra-confusion, (iii)inter-confusion, and (iv) hybrid model. The top-confusionversion considers only the most-probable path of the ASRconfusion network and applies the typical skip-gram modelon it. The intra-confusion version applies the skip-gram mod-eling on the acoustic ambiguity dimension of the confusionnetwork and ignores the contextual information, i.e., eachambiguous word alternative is predicted by the other over apre-defined local context. The inter-confusion version appliesthe skip-gram modeling on the contextual dimension but overeach of the acoustic ambiguous words. The hybrid model isa combination of both the intra and inter-confusion config-urations. More information on the training configuration isavailable in [26]. The present work builds upon this basicConfusion2vec framework.
3. CONFUSION2VEC 2.0 SUBWORD MODEL
Subword encoding of words has been popular in modelingsemantics and syntax of language using word vector rep-resentations [3, 6, 22]. The use of subwords are mainlymotivated by the fact that the subwords incorporate morpho-logical information which can be helpful, for example, inrelating the prefixes, suffixes and the word root. In this work, t-1,1 w t-1,2 w t,2 w t,1 w t,3 w t,4 w t+1,2 w t+1,1 w t+1,3 w t+2,2 w t+2,1 w t+2,3 w t+2,4 Fig. 1 : Example Confusion Network Output by ASR for the ground-truth phrase “I want to sit”we apply subword representation for encoding the word am-biguity information in the human language. We believe wehave a compelling case for the use of subwords for repre-senting the acoustic similarities (ambiguities) between thewords in the language since more similarly sounding wordsoften have highly overlapping subword representations. Thishelps model the level of overlap and estimate the magnitudeof acoustic similarity robustly. Moreover, use of subwordsshould help in efficient encoding of under-represented wordsin the language. This is crucial in the case of Confusion2vecbecause we are restricted to speech data and their corre-sponding decoded ASR lattices for training, thereby limitingword-word co-occurrence in contrast to typical word vec-tor representation which can be trained on large amounts ofeasily available plain text data. Another important aspect isthe ability to represent out-of-vocabulary words which area common place occurrence with end-to-end ASR systemsoutputting character sequences.In the proposed model, each word w is represented as asum of its constituent n-gram character subwords. This en-ables the model to infer the internal structure of each word.For example, a word “ want “ is represented with the vectorsum of the following subwords: < wa, wan, ant, nt > , < wan, want, ant > , < want, want > , < want > Symbols < and > are used to represent the beginning and endof the word. The n-grams are generated for n=3 upto n=6. Itis apparent that an acoustically ambiguous, similar soundingword “ wand ” has a high degree of overlap with the set ofn-gram characters.In this paper, we consider two modeling variations: (i)inter-confusion, and (ii) intra-confusion versions of confu-sion2vec with the subword encoding. The goal of the intra-confusion model is to estimate the inter-word relations between the acoustically ambiguous words that appear in the ASR lattices. For this, we perform skip-gram modeling over the acoustic similarity dimension (seeFigure 1) and ignore the contextual dimension of the ut-terance. The objective of the intra-confusion model is tomaximize the following log-likelihood: T (cid:88) t =1 (cid:88) ˆ a ∈ ˆ A t (cid:88) a ∈ A t log p ( w t,a | w t, ˆ a ) (1)where T is the length of the utterance (confusion network) interms of number of words, w i,j is the word in the confusionnetwork output by the ASR at time-step i and j is the index ofthe word among the ambiguous alternatives. ˆ A t is the set ofindices of all ambiguous words at time-step t , ˆ a is the indexof the current word along the acoustic ambiguity dimension, A t ⊆ ˆ A t − ˆ a is the subset of ambiguous words barring ˆ a atthe current word t , i.e., for example from Figure 1, for thecurrent word, w t, ˆ a , “ want ”, A t ⊆ { wand, won’t, what } .Additionally, for subword encoding, each word input is rep-resented as: w i,j = (cid:88) s ∈ S w x s (2)where S w is the set of all character n-grams ranging from n=3to n=6 and the word itself and x s is the vector representationfor n-gram subword s . Few training samples (input, target)generated for this configuration pertaining to input confusionnetwork in Figure 1 are (I, eye), (eye, I), (want,wand), (want, won’t), (won’t, what), (wand,what) etc. The aim of the inter-confusion model is to jointly model thecontextual co-occurrence information and the acoustic ambi-guity co-occurrence information along both the axis depictedin the confusion network. Here, the skip-gram modeling isperformed over time context and over all the possible acous-tic ambiguities. The objective of the inter-confusion model iso maximize the following log-likelihood: T (cid:88) t =1 (cid:88) ˆ a ∈ ˆ A t (cid:88) c ∈ C t (cid:88) a ∈ A c log p ( w c,a | w t, ˆ a ) (3)where C t corresponds to set of indices of nodes of confu-sion network, i.e., words around the current word t along thetime-axis and c is the current context index. A c is the set ofindices of acoustically ambiguous words at a context c . Forexample, for the current word, w t, ˆ a , “ want ” in Figure 1, A c ⊆ { I, eye, two, tees, to, seat, sit, seed,eat } and A t ⊆ { wand, won’t, what, want } . Note,each word input is subword encoded as in equation 2. Fewtraining samples (input, target) generated for this configu-ration are (want, I), (want, eye), (want, two),(want, to), (want, tees), (what, I), (what,eye), (what, to), (what, tees), (what, two),(won’t, eye) etc. Negative sampling is employed for training the embeddingmodel. Negative sampling was first introduced for trainingword2vec representation [20]. It is a simplification of theNoise Contrastive Estimation objective [9]. The negativesampling for training the embedding can be posed as a set ofbinary classification problems which operates on two classes:presence of signal or absence (noise). In the context of wordembeddings the presence of the context words are treatedas positive class and the negative class is randomly sampledfrom the unigram distribution of the vocabulary. The negativesampling for subword model can be expressed using binarylogistic loss as: log σ ( (cid:88) s ∈ S wi x Ts o w t ) + K (cid:88) k =1 E w k ∼ P n ( w ) log σ ( − (cid:88) s ∈ S wi x Ts o w k ) (4)where σ ( x ) = e − x , w i is the input word, w t is the out-put word, S w i is the set of n-gram character subwords for theword w i , x s is the vector representation for the character n-gram subword s and o w t is the output vector representationof target word w t . K is the number of negative samples tobe drawn from the negative sample, noise distribution P n ( w ) .The noise distribution P n ( w ) is chosen to be the unigram dis-tribution of words in the vocabulary raised to the / th poweras suggested in [20]. Note, for confusion2vec the input word w i and target word w t are derived according to equations 1and 3 for implementing the respective training configurations
4. EVALUATIONS
We perform evaluations of the proposed word embeddingsalong two aspects. One, in view of the assessing the useful, meaningful information embedded in the word vector repre-sentation. Second, in its application to a realistic task of spo-ken language intent detection.
For evaluating the inherent semantic and syntactic knowledgeof the word embeddings, we employ two tasks: (i) semantic-syntactic analogy task, and (ii) word similarity task. The wordanalogy task was first proposed in [19] which comprises wordpair analogy questions of the form W is to W as W is to W . The analogy is answered correct if vec ( W ) − vec ( W )+ vec ( W ) is most similar to vec ( W ) . Another prominent ap-proach is the word similarity task, where rank-correlation be-tween cosine similarity of set of pair of word vectors and hu-man annotated word similarity scores are assessed [24]. Forword similarity task, we use the WordSim-353 database [7]consisting of 353 pairs of words annotated over a score of 1to 10 depending on the magnitude of word similarity as per-ceived by humans.For assessing the word acoustic ambiguity (similarity)information, we conduct the acoustic analogy task, Seman-tic&syntactic–acoustic analogy task and Acoustic similaritytasks proposed in [26]. The Acoustic analogy task comprisesof word pair analogies compiled using homophones whichanswer questions of the form: W sounds similar to W as W sounds similar to W . The acoustic analogy task is designedto assess the ambiguity information embedded in the wordvector space [26]. The semantic&syntactic-acoustic analogytask is designed to assess semantic, syntactic and acousticambiguity information simultaneously. The analogies areformed by replacing certain words by their homophone al-ternatives in the original semantic and syntactic analogy task[26]. The acoustic word similarity task is analogous to theword similarity task, i.e., it contains of word pairs which arerated on their acoustic similarity based on the normalizedphone edit distances. A value of 1.0 refers to two wordssounding identical and 0.0 refers to the word pairs beingacoustically dissimilar. More details regarding the evaluationmethodologies are available in [26]. The evaluation datasetsare made available . We also evaluate the efficacy of the proposed word represen-tation models on the task of spoken language intent classi-fication. A recurrent neural network (RNN) based classifieris employed by initializing the embedding layer with the pro-posed word vectors. Classification experiments are conductedby training the recurrent neural network on (i) clean manualtranscripts, and (ii) noisy ASR transcripts, with evaluations onboth manual and ASR transcripts. Classification error rates of https://github.com/pgurunath/confusion2vec_2.0 odel Analogy Tasks Similarity TasksS&S Acoustic S&S-Acoustic Average Accuracy Word Similarity Acoustic SimilarityGoogle W2V [20] 61.42% 0.9% 16.99% 26.44% -0.3489In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377fastText [3] Confusion2Vec 1.0 (word) [26] C2V-a 63.97% 16.92% 43.34% 41.41% 0.5228 0.6200C2V-c
Confusion2Vec 2.0
C2V-a 56.74% (subword) C2V-c 56.87%
Table 1 : Results: Different proposed models
C2V-a: Intra-Confusion, C2V-c: Inter-Confusion, S&S: Semantic & Syntactic Analogy.
For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2evaluations (as discussed in [26]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < . . the intent detection is used to derive assessments of the wordvector representations.
5. ANALOGY & SIMILARITY TASKS5.1. Database
The Fisher English Training Part 1, Speech (LDC2004S13)and Fisher English Training Part 2, Speech (LDC2005S13)corpora [5] are used for both training the ASR and the confu-sion2vec 2.0 embeddings. The choice of database is based on[26] for direct comparison purposes. The corpus consists ofspontaneous telephonic conversations between 11,972 nativeEnglish speakers. The speech data amounts to approximately1,915 hours sampled at 8 kHz. The corpus is divided into 3parts for training (1,905 hours, 1,871,731 utterances), devel-opment (5 hours, 5000 utterances) and test (5 hours, 5000 ut-terances). Overall, the transcripts contain approximately 20.8million word tokens and vocabulary size of 42,150.
The experimental setup is maintained identical to [26] for di-rect comparison. Brief detail of the setup is as follows:
A hybrid HMM-DNN based acoustic model is trained on thetrain subset of the speech corpus using the KALDI speechrecognition toolkit [23]. 40 dimensional mel frequency cep-stral coefficients (MFCC) features are extracted along withthe i-vector features for training the acoustic model. Thei-vector features are used to provide speaker and channelcharacteristics to aid acoustic modeling. The DNN acousticmodel, comprises 7 layers with P-norm non-linearity (p=2)each with 350 units [35]. The DNN is trained using 5 MFCCframe splices with left and right context of 2 to classify among 7979 Gaussian mixtures with stochastic gradient de-scent optimizer. The CMU pronunciation dictionary [32] isused as the word-pronunciation transcription lexicon. A tri-gram language model is trained on the training subset of theFisher English Speech Corpus. The ASR yields word errorrates (WER) of 16.57% and 18.12% on the development andthe test datasets. Lattices are derived during the ASR decod-ing with a decoding beam size of 11 and lattice beam sizeof 6. The lattices are converted to confusion networks withthe minimum Bayes risk criterion [34] for training the confu-sion2vec embeddings. The resulting confusion networks havea vocabulary size of 41,274 and 69.5 million words, with anaverage of 3.34 alternative (ambiguous) words for each edgein the graph.
In order to train the embedding, most frequent words aresub-sampled as suggested in [20], with the rejection thresh-old set to − . Also, a minimum frequency threshold of5 is set and the rarely occurring words are pruned from thevocabulary. The context window size for both the acousticambiguity and contextual dimensions are uniformly sam-pled between 1 and 5. The dimension of the word vec-tors are set to 300. The number of negative samples fornegative sampling is chosen to be 64. The learning rateis set to 0.01 and trained for a total of 15 epochs usingstochastic gradient descent. All the hyper-parameters areempirically chosen for optimal performance on the devel-opment set. We implemented the confusion2vec 2.0 bymodifying the source code from fastText [3]. We makeour source code and trained models available at https://github.com/pgurunath/confusion2vec_2.0 . https://github.com/facebookresearch/fastText odel Analogy Tasks Similarity TasksS&S Acoustic S&S-Acoustic Average Accuracy Word Similarity Acoustic SimilarityGoogle W2V [20] 61.42% 0.9% 16.99% 26.44% -0.3489In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377fastText [3] 75.93% 0.46% 17.40% 31.26% 0.7361 -0.3659 Confusion2Vec 1.0 (word) [26] C2V-1 + C2V-a 67.03% 25.43% 40.36% 44.27% 0.5102 0.7231C2V-1 + C2V-c 70.84% 35.25% 35.18% 47.09% 0.5609 0.6345C2V-1 + C2V-c (JT) 65.88% fastText + C2V-a (subword) fastText + C2V-c
Table 2 : Results: Different proposed models
C2V-a: Intra-Confusion, C2V-c: Inter-Confusion, S&S: Semantic & Syntactic Analogy.
For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2evaluations (as discussed in [26]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < . . Table 1 lists the results in terms of accuracies for analogytasks and rank-correlations for similarity tasks. The firsttwo rows correspond to results with the original word2vec.Google W2V model is the open source model released byGoogle , trained on 100 billion word Google News dataset.We also train an in-domain version of original word2vecon the Fisher English corpus for fair comparison with theconfusion2vec models, referred to as “In-domain W2V” inTable 1. The fastText model employed is the open sourcemodel trained on Wikipedia dumps with a vocabulary sizeof more than 2.5 million words released by Facebook . Themiddle two rows of the table correspond to confusion2vecembeddings without subword encoding and they are takendirectly from [26]. The bottom two rows correspond to theresults obtained with subword encoding. Note, the confu-sion2vec 1.0 is initialized on the Google word2vec modelfor better convergence. The confusion2vec 2.0 model is ini-tialized on the fastText model to maintain compatibility withsubword encodings. We normalize the vocabulary for all theexperiments, meaning the same vocabulary is used to evaluatethe analogy and similarity tasks to allow for fair comparisons.Comparing the baseline word2vec and fastText embed-dings to the confusion2vec, we observe the baseline embed-dings perform well on the semantic&syntactic analogy taskand provide good positive correlation on the word similaritytask as expected. However, they perform poorly on the acous-tic analogy task, semantic&syntactic-acoustic analogy taskand give small negative correlation on the acoustic analogytask. All the confusion2vec models perform relatively wellon the semantic&syntactic analogy task and word similaritytask, but more importantly, yield high accuracies on acousticanalogy task and semantic&syntactic-acoustic analogy tasks https://code.google.com/archive/p/word2vec/ https://fasttext.cc/docs/en/pretrained-vectors.html and provide high positive correlation with the acoustic simi-larity task.Specifically with Confusion2Vec 2.0, among the analogytasks, we observe the subword encoding enhances the acous-tic ambiguity modeling. For the acoustic analogy task we findrelative improvement of upto 46.41% over its non-subwordcounterpart. Moreover, even for the semantic&syntactic-acoustic analogy task, we observe improvements with sub-word encoding. However, we find a small reduction in perfor-mance for the original semantic and syntactic analogy task.Regardless of the small dip in the performance, the accuraciesremain acceptable in comparison to the in-domain word2vecmodel. Overall, taking the average accuracy of all the anal-ogy tasks, we obtain an increase of approximately 16.62%relative over the non-subword confusion2vec models.Investigating the results for the similarity tasks, we findsignificant and high correlation of 0.81 for acoustic similaritytask with the subword encoding. Again, a small degradationis observed with the word similarity task obtaining a corre-lation of 0.3181 against the 0.4417 of the in-domain base-line word2vec model. Overall, the results of the analogy andthe similarity tasks suggest the subword encoding greatly en-hances the ambiguity modeling of confusion2vec. Further, the confusion2vec model can be concatenated withthe other word embedding models to produce a new wordvector space that can result in better representations as seenin [26]. Table 2 lists the results of the concatenated models.For the previous, non-subword version of the confusion2vec,the vector models are concatenated with the word2vec modeltrained on the ASR output transcripts (C2V-1). The choice ofusing the C2V-1 instead of the Google W2V for concatenationwas based on empirical findings. Where as to maintain com-patibility of subword encoding, the confusion2vec 2.0 models boy girl prince prints princess uncle aunt ant possibleimpossible fortunate unfortunate write wright redread reading see sea seeing prinz (a) fastText boy girl prince prints princess uncle ant possible impossible fortunate unfortunate aunt write wright red read reading see sea seeing prinz (b) Confusion2Vec 2.0: C2V-a
Fig. 2 : 2-D plots of selected word vectors portraying semantic, syntactic and acoustic relationships after dimension reductionusing PCA
The blue lines indicate semantic relationships, blue ellipses indicate syntactic relationships, red lines indicate acoustic-semantic/syntacticrelations and red ellipses indicate acoustic ambiguity word relations. are concatenated with fastText models.First, comparisons between the non-concatenated ver-sions in Table 1 and the concatenated version in Table 2, ofthe non-subword models, we observe an improvement of ap-proximately 7.22% relative in average analogy accuracy afterconcatenation. We don’t observe significant improvementwith subword based models after concatenation in terms ofaverage analogy accuracy. However, we observe differentdynamics between the acoustic ambiguity and the semanticand syntactic subspaces. Concatenation results in improvedsemantic and syntactic evaluations with the expense of dropin accuracies of acoustic analogy task. We also note im-provements (9.27% relative) in semantic&syntactic-acousticanalogy task after concatenation confirming meaningful ex-istence of both ambiguity and semantic-syntactic relations.Moreover, the word similarity task also yields better correla-tion after concatenation.Next, comparisons of the confusion2vec 1.0 (non-subword)and the subword version, we observe significant improve-ments in the semantic&symantic analogy task (7.51% rela-tive) as well as the semantic&syntactic-acoustic analogy tasks(21.78% relative). Moreover, the subword models outperformthe non-subword version in both of the similarity tasks. Thesubword models slightly under-perform in the acoustic anal-ogy task, but more crucially outperform the Google W2V andFastText baselines significantly.Further, the concatenated models can be fine-tuned and optimized to exploit additional gains as found in [26]. Therow corresponding to Confusion2Vec 1.0 - C2V + C2V-c (JT)is the best result obtained in [26] which involves 2-passes.The Confusion2Vec 2.0 with the subword modeling with asingle pass training gives comparable performance to the 2-pass approach. Thus we skip the 2-pass approach with thesubword model in favor of ease of training and reproducibil-ity.
Figure 2 illustrates the word vector spaces of fastText embed-dings and the proposed C2V-a embeddings after dimensionreduction using principal component analysis. We observemeaningful interactions between the semantic&syntacticsubspace and the acoustic ambiguity subspace. For exam-ple, in Figure 2b, vectors “boy”-“prince” , “see”-“seeing” , “read”-“write” , “uncle”-“aunt” are similar to acousticallyambiguous vector “boy”-“prints” , “sea”-“seeing” , “read”-“write” , “uncle”-“ant” respectively which is not the casein Figure 2a with fastText embeddings. Such vector rela-tionships can be exploited for downstream spoken languageapplications by providing crucial acoustic ambiguity infor-mation to recover from speech recognition errors. Also note,the acoustically ambiguous words such as “prinz” , “prince” , “prints” are found clustered together. Another importantobservation is that the word “prinz” , out-of-vocabulary innglish, has an orphaned representation under fastText inFigure 2a. However, “prinz” finds a meaningful represen-tation on the basis of acoustic signature in the proposedConfusion2vec model as seen in Figure 2b, i.e., “prinz” isclustered together with acoustically similar words “prince” & “prints” and the vector “boy”-“prinz” is similar to vec-tor “boy”-“prince” . Occurrence of out-of-vocabulary wordssuch as “prinz” is common place with end-to-end ASR sys-tems that output characters prone to errors. Note, out-of-vocabulary words such as “prinz” cannot be represented bytypical word embeddings such as word2vec, GloVe etc andhence sub-optimal for representation with many end-to-endASR systems.
6. SPOKEN LANGUAGE INTENT DETECTION
In this section, we apply the proposed word vector embed-ding to the task of spoken language intent detection. Spo-ken language intent detection is the process of decoding thespeaker’s intent in contexts involving voice commands, callrouting and any human computer interactions. Many spokenlanguage technologies use an ASR to convert the speech sig-nal to text, a process prone to errors such as due to the varyingspeaker and noise environments. The erroneous ASR outputsin turn result in degradation of the downstream intent classifi-cation. Few efforts have focused on handling the errors of theASR to make the subsequent intent detection process morerobust to errors. These efforts often involve training the intentclassification systems on noisy ASR transcripts. The down-sides of training the intent classifiers on the ASR is that thesystems are limited with the amount of speech data available.Moreover, varying speech signal conditions and use of differ-ent ASR models make such classifiers non-optimal and lesspractical. In many scenarios, speech data is not available toenable adaptation on ASR transcripts.In our previous work [27], we applied the non-subwordversion of the Confusion2vec to the task of spoken languageintent detection. We demonstrated the Confusion2vec is ableto perform as efficiently as the popular word embeddings likeword2vec and GloVe on clean manual transcripts giving com-parable classification error rates. More importantly, we wereable to illustrate the robustness of the confusion2vec em-beddings when evaluated on the noisy ASR transcripts. Theconfusion2vec gives significantly better accuracies (upto rel-ative 20% improvements) when evaluated on ASR transcriptscompared to the word2vec, GloVe embeddings and state-of-the-art models involving more complex neural network intentclassification architectures. Moreover, we also showed thatConfusion2vec suffers the least degradation between cleanand ASR transcripts. We also found that the Confusion2vecconsistently provides the best classification rates even whenthe intent classifier is trained on ASR transcripts. The exper-iments indicated that the loss in accuracies between trainingthe intent classifier on clean versus the ASR transcripts is reduced to 0.89% from 2.57% absolute. Overall, the resultsillustrate that confusion2vec has inherent knowledge of theacoustic ambiguity (similarity) word relations which corre-lates with the ASR errors using which the classifier is able torecover from certain errors more efficiently. In this section,we incorporate the confusion2vec 2.0 embeddings with in-herent knowledge of acoustic ambiguity to allow robust intentclassification.
We conduct experiments on the Airline Travel InformationSystems (ATIS) benchmark dataset [12]. The dataset consistsof humans making flight related inquiries with an automatedanswering machine with audio recorded and its transcriptsmanually annotated. ATIS consists of 18 intent categories.The dataset is divided into train (4478 samples), development(500 samples) and test (893 samples) consistent with previousworks [27, 11, 8]. For ASR evaluations, the audio recordingsare down-sampled from 16kHz to 8kHz and then decoded us-ing the ASR setup described in section 5.2.1 using the audiomappings . The ASR achieves a WER of 18.54% on the ATIStest set. For intent classification we adopt a simple RNN architectureidentical to [27], to allow for direct comparison. The archi-tecture of the neural network is intentionally kept simple foreffective inference of the efficacy of the proposed embeddingword features. The classifier is comprised of an embeddinglayer followed by a single layer of bi-directional recurrentneural network (RNN) with long short-term memory (LSTM)units which is followed by a linear dense layer with soft-max function to output a probability distribution across all theintent categories. The embedding layer is fixed throughoutthe training except for the randomly initialized embeddingswhere the embedding is estimated on the in-domain data spe-cific to the task of intent detection.The intent classification models are trained on the 4478samples of training subset and the hyper-parameters are tunedon the development set. We choose the best set of hyper-parameters yielding the best results on the development setand then apply it on the unseen held-out test subset of boththe manual clean transcripts and the ASR transcripts andreport the results. For training we treat each utterance as asingle sample (batch size = 1). The hyper-parameter spacewe experiment are as follows: the hidden dimension sizeof the LSTM is tuned over { , , , } , the learn-ing rate over { . , . } , the dropout is tuned over { . , . , . , . } . The Adam optimizer is employed foroptimization and trained for a total of 50 epochs with early https://github.com/pgurunath/slu confusion2vec topping when the loss on the development set doesn’t im-prove for 5 consecutive epochs. We include results from several baseline systems for pro-viding comparisons of Confusion2Vec 2.0 with the popularcontext-free word embeddings, contextual embeddings, pop-ular established NLU systems and the current state-of-the-art.1.
Context-Free Embeddings : GloVe [21], skip-gramword2vec [20] and fastText [3] word representationsare employed. They are referred to as context-free em-beddings since the word representations are static irre-spective of the context.2. ELMo : Peters et al. [22] proposed deep contextualizedword representation based on character based deepbidirectional language model trained on large text cor-pus. The models effectively model syntax and seman-tics of the language along varying linguistic contexts.Unlike context-free embeddings, ELMo embeddingshave varying representations for each word dependingon the word’s context. We employ the original modeltrained on 1 Billion Word Benchmark with 93.6 millionparameters . For intent-classification we add a singlebi-directional LSTM layer with attention for multi-taskjoint intent and slot predictions.3. BERT : Devlin et al. [6] introduced BERT bidirec-tional contextual word representations based on selfattention mechanism of Transformer models. BERTmodels make use of masked language modeling andnext sentence prediction to model language. Similarto ELMo, the word embeddings are contextual, i.e.,vary according to the context. We employ “bert-base-uncased” model with 12 layers of 768 dimensionseach trained on BookCorpus and English Wikipediacorpus. For intent-classification we add a single bi-directional LSTM layer with attention for multi-taskjoint intent and slot predictions.4. Joint SLU-LM : Liu and Lane [17] employed jointmodeling of the next word prediction along with intentand slot labeling. The unidirectional RNN model up-dates intent states for each word input and uses it ascontext for slot labeling and language modeling.5.
Attn. RNN Joint SLU : Liu and Lane [16] proposedattention based encoder-decoder bidirectional RNN https://nlp.stanford.edu/projects/glove/ https://code.google.com/archive/p/word2vec/ https://fasttext.cc/docs/en/pretrained-vectors.html https://allennlp.org/elmo https://github.com/google-research/bert model in a multi-task model for joint intent and slot-filling tasks. A weighted average of the encoder bidi-rectional LSTM hidden states provides informationfrom parts of the input word sequence which is usedtogether with time aligned encoder hidden state for thedecoder to predict the slot labels and intent.6. Slot-Gated Attn. : Goo et al. [8] introduced a slot-gated mechanism which introduces additional gateto improve slot and intent prediction performance byleveraging intent context vector for slot filling task.7.
Self Attn. SLU : Li et al. [15] proposed self-attentionmodel with gate mechanism for joint learning of intentclassification and slot filling by utilizing the semanticcorrelation between slots and intents. The model esti-mates embeddings augmented with intent informationusing self attention mechanism which is utilized as agate for slot filling task.8.
Joint BERT : Chen et al. [4] proposed to use BERT em-beddings for joint modeling of intent and slot-filling.The pre-trained BERT embeddings are fine-tuned for(i) sentence prediction task - intent detection, and (ii)sequence prediction task - slot filling. The Joint BERTmodel lacks the bi-directional LSTM layer in compari-son to the earlier baseline
BERT based model.9.
SF-ID Network : Haihong et al. [10] introduced a bi-directional interrelated model for joint modeling of in-tent detection and slot-filling. An iteration mechanismis proposed where the SF subnet introduces the intentinformation to slot-filling task while the ID-subnet ap-plies the slot information to intent detection task. Forthe task of slot-filling a conditional random field layeris used to derive the final output.10.
ASR Robust ELMo : Huang and Chen [13] proposedASR robust contextualized embeddings for intent de-tection. ELMo embeddings are fine-tuned with a novelloss function which minimizes the cosine distance be-tween the acoustically confused words found in ASRconfusion networks. Two techniques based on super-vised and unsupervised extraction of word confusionsare explored. The fine-tuned contextualized embed-dings are then utilized for spoken language intent de-tection.
In this section, we conduct experiments by training models on(i) clean human annotations and (ii) noisy ASR transcriptions. odel Reference ASR ∆ diff Context-Free Embeddings Random 2.69 10.75 8.06GloVe [21] 1.90 8.17 6.27Word2Vec [20] 2.69 8.06 5.37fastText [3] 1.90 8.40 6.50Joint SLU-LM [17] † † † † † † † † Proposed Context-Free Embeddings C2V-c 2.0 3.36 5.82 2.46C2V-a 2.0 2.46 fastText + C2V-c 2.0 1.79
Table 3 : Results: Model trained on clean Reference: Classification Error Rates (CER) for Reference and ASR Transcripts ∆ diff is the absolute degradation of model from clean to ASR. C2V 1.0 corresponds to C2V-1 + C2V-c (JT) in Table 1 and 2. † indicates joint modeling of intent and slot-filling. Table 3 lists the results of the intent detection in terms of clas-sification error rates (CER). The “Reference” column corre-sponds to results on human transcribed ATIS audio and the“ASR” corresponds to the evaluations on the noisy speechrecognition transcripts. Firstly, evaluating on the Referenceclean transcripts, we observe the confusion2vec 2.0 with sub-word encoding is able to achieve the third best performance.The best performing confusion2vec 2.0 achieves a CER of1.79%. Among the different versions of the proposed sub-word based confusion2vec, we find that the concatenated ver-sions are slightly better. We believe this is because the con-catenated models exhibit better semantic and syntactic rela-tions (see Table 1 and 2) compared to the non-concatenatedones. Among the baseline models, the contextual embed-ding like BERT and ELMo gives the best CER. Note, theproposed confusion2vec embeddings are context-free and areable to outperform other context-free embedding models suchas GloVe, word2vec and fastText.Secondly, evaluating the performance on the noisy ASRtranscripts, we find that all the subword based confusion2vec2.0 models outperform the popular word vector embeddingsby a big margin. The subword-confusion2vec gives an im-provement of approximately 45.78% relative to the best per-forming context-free word embeddings. The proposed em-beddings also improve over the contextual embeddings in- cluding BERT and ELMo (relative improvements of 29.06%).Moreover, the results are also an improvement over the non-subword confusion2vec word vectors (31.50% improvement).Comparisons between the different versions of the proposedconfusion2vec show the intra-confusion configuration yieldsthe least CER. The best results with the proposed model out-performs the state-of-the-art (ASR Robust ELMo [13]) by re-ducing the CER by a relative of 13.12%. Inspecting the degra-dation, ∆ diff (drop in performance between the clean and ASRevaluations), we find that all the confusion2vec 2.0 with sub-word information undergo low degradation while giving thebest CER, thereby re-affirming the robustness to noise in tran-scripts. This confirms our initial hypothesis that the subwordencoding is better able to represent the acoustic ambiguitiesin the human language. Table 4 presents the results obtained by training models on theASR transcripts and evaluated on the ASR transcripts. Herewe omit all the joint intent-slot filling baseline models, sincetraining on ASR transcripts need aligned set of slot labels dueto insertion, substitution and deletion errors which is out-of-scope of this study. We note that the confusion2vec mod-els give significantly lower CER. The subword based con-fusion2vec models also provide improvements over the non-subword based confusion2vec model (21.28% improvement). odel WER % CER %
Random 18.54 5.15GloVe [21] 18.54 6.94Word2Vec [20] 18.54 5.49Schumann and Angkititrakul [25] 10.55 5.04 C2V 1.0 18.54 4.70C2V-c 2.0 18.54 4.82C2V-a 2.0 18.54 fastText + C2V-c 2.0 18.54 fastText + C2V-a 2.0 18.54 : Results: Model trained and evaluated on ASRtranscripts.
C2V 1.0 corresponds to C2V-1 + C2V-c (JT) in Table 1 and 2
Comparing the results in Table 3 and Table 4, we would like tohighlight the subword-confusion2vec model gives a minimumCER of 4.37% on model trained on clean transcripts whichis much better than the CER obtained by popular word em-beddings like word2vec, GloVe, fastText even when trainedon the ASR transcripts (15.15% better relatively). These re-sults prove the subword-confusion2vec models can eliminatethe need for re-training natural language understanding andprocessing algorithms on ASR transcripts for robust perfor-mance.
7. CONCLUSION
In this paper, we proposed the use of subword encoding formodeling the acoustic ambiguity information and augmentword vector representations along with the semantic and syn-tax of the language. Each word in the language is repre-sented as a sum of its constituent character n-gram subwords.The advantages of the subwords are confirmed by evaluat-ing the proposed models on various word analogy tasks andword similarity tasks designed to assess the effective acousticambiguity, semantic and syntactic knowledge inherent in themodels. Finally, the proposed subword models are appliedto the task of spoken language intent detection. The resultsof intent classification system suggest the proposed subwordconfusion2vec models greatly enhance the classification per-formance when evaluated on the noisy ASR transcripts. Theresults highlight that subword-confusion2vec models are ro-bust and domain-independent and do not need re-training ofthe classifier on ASR transcript.In the future, we plan to model ambiguity information us-ing deep contextual modeling techniques such as BERT. Webelieve bidirectional information modeling with attention canfurther enhance ambiguity modeling. On the application side,we plan to implement and assess the effect of using Confu- We don’t domain-constrain, optimize or re-score our ASR, as in [25] sion2vec models for a wide range of natural language under-standing and processing applications such as speech transla-tion, dialogue tracking etc.
References [1] Jennifer Aydelott and Elizabeth Bates. Effects of acous-tic distortion and semantic context on lexical access.
Language and cognitive processes , 19(1):29–56, 2004.[2] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, andChristian Jauvin. A neural probabilistic language model.
Journal of machine learning research , 3(Feb):1137–1155, 2003.[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. Enriching word vectors with subwordinformation.
Transactions of the Association for Com-putational Linguistics , 5:135–146, 2017.[4] Qian Chen, Zhu Zhuo, and Wen Wang. Bert for jointintent classification and slot filling. arXiv preprintarXiv:1902.10909 , 2019.[5] Christopher Cieri, David Miller, and Kevin Walker. Thefisher corpus: a resource for the next generations ofspeech-to-text. In
LREC , volume 4, pages 69–71, 2004.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXivpreprint arXiv:1810.04805 , 2018.[7] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and EytanRuppin. Placing search in context: The concept revis-ited. In
Proceedings of the 10th international conferenceon World Wide Web , pages 406–414, 2001.[8] Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-LiHuo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-NungChen. Slot-gated modeling for joint slot filling and in-tent prediction. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics: Human LanguageTechnologies, Volume 2 (Short Papers) , volume 2, pages753–757, 2018.[9] Michael U Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation of unnormalized statistical mod-els, with applications to natural image statistics.
Thejournal of machine learning research , 13(1):307–361,2012.[10] E Haihong, Peiqing Niu, Zhongfu Chen, and MeinaSong. A novel bi-directional interrelated model for jointintent detection and slot filling. In
Proceedings of the7th Annual Meeting of the Association for Computa-tional Linguistics , pages 5467–5471, 2019.[11] Dilek Hakkani-T¨ur, G¨okhan T¨ur, Asli Celikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-YiWang. Multi-domain joint semantic frame parsing usingbi-directional rnn-lstm. In
Interspeech , pages 715–719,2016.[12] Charles T Hemphill, John J Godfrey, and George R Dod-dington. The ATIS spoken language systems pilot cor-pus. In
Speech and Natural Language: Proceedings ofa Workshop Held at Hidden Valley, Pennsylvania, June24-27, 1990 , 1990.[13] Chao-Wei Huang and Yun-Nung Chen. Learning asr-robust contextualized embeddings for spoken languageunderstanding. In
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pages 8009–8013. IEEE, 2020.[14] Faisal Ladhak, Ankur Gandhe, Markus Dreyer, LambertMathias, Ariya Rastrow, and Bj¨orn Hoffmeister. Lat-ticernn: Recurrent neural networks over lattices. In
In-terspeech , pages 695–699, 2016.[15] Changliang Li, Liang Li, and Ji Qi. A self-attentivemodel with gate mechanism for spoken language under-standing. In
Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing , pages3824–3833, 2018.[16] Bing Liu and Ian Lane. Attention-based recurrent neuralnetwork models for joint intent detection and slot filling.In
Interspeech 2016 , pages 685–689, 2016. doi: 10.21437/Interspeech.2016-1352.[17] Bing Liu and Ian Lane. Joint online spoken language un-derstanding and language modeling with recurrent neu-ral networks. arXiv preprint arXiv:1609.01462 , 2016.[18] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recurrent neuralnetwork based language model. In
Eleventh AnnualConference of the International Speech CommunicationAssociation , 2010.[19] Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 , 2013.[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. Distributed representations ofwords and phrases and their compositionality. In
Ad-vances in neural information processing systems , pages3111–3119, 2013. [21] Jeffrey Pennington, Richard Socher, and Christopher DManning. Glove: Global vectors for word representa-tion. In
Proceedings of the 2014 conference on empir-ical methods in natural language processing (EMNLP) ,pages 1532–1543, 2014.[22] Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 , 2018.[23] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al. The kaldi speech recognition toolkit. In
IEEE 2011workshop on automatic speech recognition and under-standing , number CONF. IEEE Signal Processing Soci-ety, 2011.[24] Tobias Schnabel, Igor Labutov, David Mimno, andThorsten Joachims. Evaluation methods for unsuper-vised word embeddings. In
Proceedings of the 2015conference on empirical methods in natural languageprocessing , pages 298–307, 2015.[25] Raphael Schumann and Pongtep Angkititrakul. Incor-porating asr errors with attention-based, jointly trainedrnn for intent detection and slot filling. In , pages 6059–6063. IEEE, 2018.[26] Prashanth Gurunath Shivakumar and Panayiotis Geor-giou. Confusion2vec: Towards enriching vector spaceword representations with representational ambiguities.
PeerJ Computer Science , 5:e195, 2019.[27] Prashanth Gurunath Shivakumar, Mu Yang, and Panayi-otis Georgiou. Spoken language intent detection usingconfusion2vec. arXiv preprint arXiv:1904.03576 , 2019.[28] Matthias Sperber, Graham Neubig, Ngoc-Quan Pham,and Alex Waibel. Self-attentional models for lattice in-puts. arXiv preprint arXiv:1906.01617 , 2019.[29] Kai Sheng Tai, Richard Socher, and Christopher D Man-ning. Improved semantic representations from tree-structured long short-term memory networks. arXivpreprint arXiv:1503.00075 , 2015.[30] Zhixing Tan, Jinsong Su, Boli Wang, Yidong Chen, andXiaodong Shi. Lattice-to-sequence attentional neuralmachine translation models.
Neurocomputing , 284:138–147, 2018.[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,and Illia Polosukhin. Attention is all you need. In
Ad-vances in neural information processing systems , pages5998–6008, 2017.32] Robert Weide. The cmu pronunciation dictionary, re-lease 0.6, 1998.[33] Fengshun Xiao, Jiangtong Li, Hai Zhao, Rui Wang,and Kehai Chen. Lattice-based transformer en-coder for neural machine translation. arXiv preprintarXiv:1906.01282 , 2019.[34] Haihua Xu, Daniel Povey, Lidia Mangu, and Jie Zhu.Minimum bayes risk decoding and system combinationbased on a recursion for edit distance.
Computer Speech& Language , 25(4):802–828, 2011.[35] Xiaohui Zhang, Jan Trmal, Daniel Povey, and SanjeevKhudanpur. Improving deep neural network acousticmodels using generalized maxout networks. In2014IEEE international conference on acoustics, speechand signal processing (ICASSP)