[PDF] Acoustic Neighbor Embeddings

Abstract

This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.

Full PDF

aa r X i v : . [ ee ss . A S ] O c t Manuscript preprint A C OUSTIC N EIGHBOR E MB EDDINGS

Woojay Jeon

AppleOne Apple Park WayCupertino, CA 95014, USA [email protected] A BSTRACT

This paper proposes a novel acoustic word embedding called Acoustic Neigh-bor Embeddings where speech or text of arbitrary length are mapped to a vectorspace of ﬁxed, reduced dimensions by adapting stochastic neighbor embedding(SNE) to sequential inputs. The Euclidean distance between coordinates in theembedding space reﬂects the phonetic confusability between their correspondingsequences. Two encoder neural networks are trained: an acoustic encoder thataccepts speech signals in the form of frame-wise subword posterior probabilitiesobtained from an acoustic model and a text encoder that accepts text in the formof subword transcriptions. Compared to a known method based on a triplet loss,the proposed method is shown to have more effective gradients for neural networktraining. Experimentally, it also gives more accurate results when the two encodernetworks are used in tandem in a word (name) recognition task, and when thetext encoder network is used standalone in an approximate phonetic match task.In particular, in an isolated name recognition task depending solely on Euclideannearest-neighbor search between the proposed embedding vectors, the recognitionaccuracy is identical to that of conventional ﬁnite state transducer(FST)-based de-coding using test data with up to 1 million names in the vocabulary and 40 dimen-sions in the embeddings.

NTRODUCTION

Acoustic word embeddings (Levin et al., 2013; Maas et al., 2012) are vector representations ofwords that capture information on how the words sound , as opposed to word embeddings that cap-ture information on what the words mean . A number of acoustic word embedding methods havebeen proposed, applied to word discrimination (He et al., 2017; Jung et al., 2019), lattice rescoringin automatic speech recognition (ASR) (Bengio & Heigold, 2014), and query-by-example keywordsearch (Settle et al., 2017) or detection (Chen et al., 2015). Recently, triplet loss functions (He et al.,2017; Settle et al., 2019; Jung et al., 2019) have been used to train two neural networks simultane-ously: an acoustic encoder f ( · ) network that accepts speech, and a text encoder g ( · ) network thataccepts text as the input. By training the two to transform their inputs into a common space wherematching speech and text get mapped to the same coordinates, f and g can be used in tandem inapplications where a speech utterance is compared against a database of text, or vice versa. Theycan also each be used as a standalone, general-purpose word embedding network that maps similar-sounding speech (in the case of f ) or text (in the case of g ) to similar locations in the embeddingspace.An obvious application is highly-scalable isolated word (or name) recognition (e.g. in a music playerapp where the user can tap the search bar to say a song or album title), where a given speech inputis mapped via f to an embedding vector f , which is then compared against a database of embeddingvectors { g , · · · , g N } , prepared via g , that represents a vocabulary of N words, to classify thespeech. Using the Euclidean distance, the classiﬁcation rule is: i = arg min ≤ j ≤ N || f − g j || , (1) It is interesting to note that the basic notion of “memorizing” audio signals in ﬁxed dimensions can betraced back to as early as Longuet-Higgins (1968) || · || is the L norm. Vector distances have been used in the past for other matching problems(e.g. Schroff et al. (2015)). For speech recognition, a rule like (1) is interesting because it can beeasily parallelized for fast evaluation on modern hardware (e.g. Garcia et al. (2008)), and the vocab-ulary can also be trivially updated since each entry is a vector. If proven to work well in isolated wordrecognition, the embedding vectors could be used in continuous speech recognition where speechsegments hypothesized to contain named entities can be recognized via nearest-neighbor search, par-ticularly when we want an entity vocabulary that is both large and dynamically updatable. However,none of the aforementioned papers have reported results in isolated word recognition.The main contribution in this paper is a new training method for f and g which adapts stochasticneighbor embedding (SNE) (Hinton & Roweis, 2003) to sequential data. It will be shown by anal-ysis of the gradients of the proposed loss function that it is more effective than the triplet loss formapping similar-sounding inputs to similar locations in a vector space. It will also be shown thatthe embeddings produced by the proposed method, called Acoustic Neighbor Embeddings (ANE),work better than embeddings produced by a triplet-loss-based method in isolated name recognitionand approximate phonetic match experiments. This paper also claims to be the ﬁrst work that suc-cessfully uses a simple L distance between vectors to directly perform isolated word recognitionthat can match the accuracy of conventional FST-based recognition over large vocabularies.One design choice in this work for the acoustic encoder f is that instead of directly reading acousticfeatures, a separate acoustic model is used to preprocess them into (framewise) subword posteriorprobability estimates. “Posteriorgrams” have been used in past studies (e.g. Hazen et al. (2009)) forspeech information retrieval. In the proposed work, they allow the f network to be much smaller,since the task of resolving channel and speaker variability can be delegated to a state-of-the-art ASRacoustic model. One can still use acoustic features with the proposed method, but in many practicalscenarios an acoustic model is already available for ASR that computes subword scores for everyspeech input, so it is feasible to reuse those scores.As inputs to the text encoder g , this study experiments with two types of text: phone sequencesand grapheme sequences. In the former case, a grapheme-to-phoneme converter (G2P) is used toconvert each text input to one or more phoneme sequences, and each phoneme sequence is treatedas a separate “word.” This approach reduces ambiguities caused by words that could be pronouncedmultiple ways, such as “A.R.P.A”, which could be pronounced as (in ARPABET phones) “aa, r, p,aa”, or “ey, aa, r, p, iy, ey”, or “ey, d, aa, t, aa, r, d, aa, t, p, iy, d, aa, t, ey” (pronouncing every “.”as “dot”). In the latter case using graphemes, more errors can occur in word recognition because asingle embedding vector may not capture all the variations in how a word may sound. On the otherhand, such a system can be more feasible because it does not require a separate G2P.Also note that while we use the term “word embedding” following known terminology in the litera-ture, a “word” in this work can actually be a sequence of multiple words, such as “John W Smith”or “The Hilton Hotel” as in the name recognition experiments in Section 5. EVIEW OF STOCHASTIC NEIGHBOR EMBEDDING

In short, stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003) is a method of reducingdimensions in vectors while preserving relative distances, and is a popular method for data visual-ization. Given a set of N coordinates { x , · · · , x N } , SNE provides a way to train a function f ( · ) that maps each coordinate x i to another coordinate of lower dimensionality f i where the relativedistances among the x i ’s are preserved among the corresponding f i ’s.The distance between two points x i and x j in the input space is deﬁned as the squared Euclideandistance with some scale factor σ i : d ij = || x i − x j || σ i . (2)This distance is used to deﬁne the probability of x i choosing x j as its neighbor in the input space: p ij = exp (cid:0) − d ij (cid:1)P k = i exp ( − d ik ) . (3)2anuscript preprintThe corresponding “induced” probability in the embedding space is q ij = exp (cid:0) −|| f i − f j || (cid:1)P k = i exp ( −|| f i − f k || ) . (4)The loss function for training f is the Kullback-Leibler divergence between the two distributions: L f = X i,j p ij log p ij q ij , (5)which can be differentiated to obtain this beautiful equation: ∂ L f ∂ f i = 2 X j ( f i − f j )( p ij − q ij + p ji − q ji ) . (6)Hinton & Roweis (2003)’s cogent interpretation of the above equation as “a sum of forces pulling f i toward f j or pushing it away depending on whether j is observed to be a neighbor more or less oftenthan desired” is a seed for other arguments that will be made in the present paper. ROPOSED EMBEDDING METHOD

ETHOD DESCRIPTION

Consider a training set of N speech utterances. The n ’th utterance is characterized by ( S n , X n , Y n ) where S n is the audio signal containing one or more words of speech, X n is a sequence of subwordposterior probability vectors for S n , and Y n is a sequence of subwords pertaining to the referencetranscription of S n . X n = [ x , x , · · · , x T ] is obtained from an acoustic model such as a DNN-HMM (Deep Neu-ral Network-Hidden Markov Model) system, where the d ’th element of x t is an estimate of theposterior probability of subword w d occurring at frame t given the speech signal S n . There is asubword posterior vector for every speech frame, and T is the number of frames in the utterance.While monophone posteriors are used in this study, other subword posteriors could be used, such asHMM state or grapheme posteriors. Also note that raw acoustic features could be used for X n , butposteriors were used here for reasons mentioned in Section 1.The subword reference transcription of each utterance is also obtained, such as the phone sequence“k, r, ao, s, ih, ng,” or the grapheme sequence “c, r, o, s, s, i, n, g” for the word “crossing,” byforce-aligning the utterance with its manual word transcription using an ASR with a pronunciationdictionary. This sequence is represented as a sequence of 1-hot vectors Y n = [ y , y , · · · , y M ] where M is the number of subwords in the sequence. Each subword usually occupies at least oneframe of speech, so T ≥ M for every utterance.Note that the “subwords” used for the posteriors in X n need not be the same as the “subwords” usedfor the transcriptions in Y n (for ANE-g in Section 5, X n is a monophone posterior vector sequence,whereas Y n is a grapheme transcription).The idea in this paper is to train an acoustic encoder neural network f ( · ) that will transform eachposterior vector sequence to a single ﬁxed-dimension embedding vector f n = f ( X n ) such thatpredeﬁned relative distances between data samples in the space of X n will be preserved in the spaceof f n in a manner similar to SNE.First, since each X i is a sequence of vectors, the Euclidean distance in (2) does not make sensein our input space. Instead, we deﬁne the distance between X i and X j based on whether theirtranscriptions are an exact match in the space of Y : d ij = (cid:26) if Y i = Y j ∞ else . (7)It is worth noting that alternate forms of d ij based on dynamic time warping between X i and X j (in a manner similar to Hazen et al. (2009)) with heuristic insertion, deletion, and substitution costswere also tried, but the binary distance above gave better accuracy.3anuscript preprintThe distance in (7) results in the following probability for (3): p ij = (cid:26) /c i if Y i = Y j else , (8)where c i is the number of utterances (other than the i ’th) that have the same subword sequence Y i .We use the same induced probability in (4) for the embedding space. Since this probability is ex-plicitly based on the L norm, all comparisons we do in the embedding space (e.g. word recognitionusing (1) or phonetic distance computations in Table 1) are done using the L norm.The loss function for training the acoustic encoder f is L f as described in (5). Once we have fullytrained f , we simply train the text encoder g such that its output for every subword sequence Y n willmatch as closely as possible the output of f for the corresponding subword posterior sequence X n ,where we keep f ﬁxed. A simple mean square error loss proves to be sufﬁcient for this purpose: L g = N X n =1 || g ( Y n ) − f ( X n ) || . (9)3.2 T RAINING STRATEGY FOR THE f ACOUSTIC ENCODER

In practice, the training data for the proposed system is much larger (millions of utterances) thanthe original data (thousands of data points) for which SNE (Hinton & Roweis, 2003) was proposed.Hence, we sample the data into equally-sized “microbatches” for computing the loss in (5). Each“microbatch” consists of N data samples, each sample characterized by a subword posterior se-quence and subword transcription ( X, Y ) . A ﬁxed number of microbatches then form a minibatch(as in “minibatch training”), and the minibatch loss is the average loss of the microbatches therein.Because the utterances are extremely diverse, forming each microbatch via purely random samplingoften results in all the subword sequences being different from each other and p ij = 0 everywhere in(8). In such a case, the loss in (5) would be , so all gradients would be 0, and the microbatch wouldhave no effect on training. To avoid wasting training time on defunct microbatches, for every mi-crobatch we designate ( X , Y ) the “pivot,” and artiﬁcially search for at least one sample ( X n , Y n ) in the training data that satisﬁes Y n = Y and insert it in the microbatch. The other samples (with adifferent subword transcription from Y ) in the microbatch are chosen purely randomly. For furthersimplicity, we compute the loss over only a subset of i, j pairings in (5), by ﬁxing i to : L f = X j p j log p j q j . (10)3.3 P HONETIC CONFUSABILITY REFLECTED IN THE g VECTORS

In the proposed method, a clear intuition exists on how the Euclidean distance between two embed-ding vectors g i and g j can directly reﬂect the acoustic confusability of their corresponding (purelytext) subword sequences Y i and Y j .Since the text encoder g directly mirrors via (9) the embeddings generated by the acoustic encoder f , the g vectors – generated purely from text – will mirror the knowledge learned by f on how thetext actually sounds.Consider the two vowels “ae” (as in “bat”) and “eh” as in (“bet”). Since they sound similar, it islikely that their posteriors will behave similarly; if one scores high, the other will also score high.On the other hand, the consonants “j” and “s” sound distinctly different, and therefore their scoreswill tend to be opposite of each other; if one scores high, the other will score low, and vice versa.Now imagine acoustic instances of “Jackson” (“j, ae, k, s, ah, n”), “Jeckson” (“j, eh, k, s, ah, n”), and“Sackson” (“s, ae, k, s, ah, n”) appearing as training samples for f . Both the second and third wordare only one phoneme away from “Jackson”, but the phone posteriorgram sequence for “Jeckson”will be close to that of “Jackson” due to the scores for “ae” and “eh” trending similarly, so their f embeddings will also be inevitably similar. On the other hand, the posteriorgram sequence for“Sackson” will be different from that for “Jackson” because the scores for “j” and “s” tend to bemutually exclusive, so their f embeddings will be different.4anuscript preprintTable 1: Euclidean distance between g vectors for different pairs of words, computed by the textencoder g from pure text. The differences in distances are consistent with the intuitive phoneticconfusability between the words.Words Phone sequences fed to Distance g network between g ’sJackson j, ae, k, s, ah, n 0.0478Jeckson j, eh, k, s, ah, nJackson j, ae, k, s, ah, n 1.331Sackson s, ae, k, s, ah, ngame of thrones g, ey, m, ax, f, th, r, ow, n, z 0.122game of drones g, ey, m, ax, f, d, r, ow, n, zgame of thrones g, ey, m, ax, f, th, r, ow, n, z 1.077fame of thrones f, ey, m, ax, f, th, r, ow, n, zNow consider the g embeddings obtained from pure text inputs “Jackson”, “Jeckson”, and “Sack-son”. As we can see in Table. 1, “Jackson” is much closer to “Jeckson” than it is to “Sackson” in the g embedding space, which is consistent with what actually sounds more similar. Similarly, “gameof thrones” is closer to “game of drones” than “fame of thrones.”3.4 T HE IMPORTANCE OF NORMALIZATION

It is interesting to see what would happen if no normalization were done in (3) and (4). The scoresin (8) would be trivial (1 or 0) instead of scaled ( /c i or 0) binary values, and the computationwould be signiﬁcantly reduced with no denominator in (4). While this simplicity may be tempting,inspection of the resulting gradient shows that it is not a good idea: ∂ L f ∂ f i = 4 X j ( f i − f j ) p ij . (11)Comparing this with (6) (or later with (12)), we can see that q ij has disappeared entirely fromthe gradient, meaning the similarity in the embedding space no longer plays a role in weightingthe gradient. Only the similarity in the input space ( p ij ) inﬂuences the gradient weight, so thefundamental notion of preserving relative distances is no longer being enforced. OMPARISON WITH THE TRIPLET LOSS

Some insight into how the proposed method compares with existing triplet-loss-based methods(He et al., 2017) can be found by inspecting the gradients of the two methods’ loss functions usedin backpropagation.For ANE’s f network, we can apply (8) to (6) to obtain the gradient of the loss function. Thesummation can be split into the sum over the samples with the same subword sequence as i , i.e., J + i = { j : Y j = Y i } , and the rest of the samples, i.e., J − i = { j : Y j = Y i } : ∂ L f ∂ f i = 2 X j ∈ J + i ( f i − f j ) (cid:18) c i − q ij + 1 c j − q ji (cid:19) − X j ∈ J − i ( f i − f j )( q ij + q ji ) . (12)Now, let us consider the triplet loss, placed in the same context as ANE. For every given triplet ofposterior sequences ( X , X m , X n ) with a corresponding triplet of subword sequences ( Y , Y m , Y n ) ,where Y = Y m and Y = Y n , we have the following loss (He et al., 2017; Settle et al., 2019;Jung et al., 2019): L trip = max { , α − Sim ( f , g m ) + Sim ( f , g n ) } , (13)where α is some constant and Sim ( f i , g j ) is some pre-deﬁned similarity function between f i and g i . The idea is to make the embeddings attract each other if their utterances have identical subwordsequences, and repel each other if the subword sequences are different.5anuscript preprintFor the purpose of this analysis, we can ignore the max operator and set α = 0 in (13), since they aremerely for data selection. Let us consider a “batch” triplet loss that is summed over all the samplesin our set of N speech utterances: L trip = X i,j s ij Sim ( f i , g j ) , (14)where we have deﬁned s ij = (cid:26) − if Y i = Y j +1 else . (15)Also, since the training tries to make f j as close as possible to g j for every j , we make the approxi-mation g j ≈ f j , resulting in L trip = X i,j s ij q ′ ij , (16)where q ′ ij , Sim ( f i , f j ) .A variety of possibilities exist for Sim ( f i , g j ) in (13). If we assume a Euclidean-distance-based formsimilar to (4), Sim ( f i , g j ) , exp (cid:8) −|| f i − g j || (cid:9) , (17)it can be determined that the gradient of the loss in (16) is ∂ L trip ∂ f i = 4 X j ∈ J + i ( f i − f j ) q ′ ij − X j ∈ J − i ( f i − f j ) q ′ ij . (18)Comparing (18) with (12), we notice a key difference in the contribution of the samples in J + i to thegradient. In the case of the triplet loss, the contribution to the gradient is ampliﬁed as the similarity q ′ ij increases, causing more perturbation to the model parameters during backpropagation. Thisbehavior is counterintuitive because a high q ′ ij means the embedding for X i and X j are alreadyalmost the same, which is exactly what we’re trying to achieve for j ∈ J + i , so there is no need tochange them further. In contrast, in the ANE loss in (12) one can see in the summation over J + that as q ij gets higher, we amplify the gradient less and therefore perturb the model parameters less,which is consistent with intuition.Alternate similarity functions for (17) can also be tried. If Sim ( f i , g j ) , −|| f i − g j || , the gradientin (18) becomes − P j s ij ( f i − f j ) , and if Sim ( f i , g j ) , f Ti g j / ( || f i || · || g j || ) (cosine similarity),the gradient is − P j s ij (cid:8) f i q ′ ij / || f i || − f j / ( || f i || · || f j || ) (cid:9) , and we can make similar arguments asabove that they are less effective than the proposed method.In summary, both the triplet-loss-based method (He et al., 2017) and the proposed method ANEseem to share the same basic notion of pulling vectors closer if they represent the same word andpushing them apart if not. However, ANE has the added subtlety of pushing or pulling with more“measured strength” based on how good the embeddings currently are. AME RECOGNITION EXPERIMENTS

Three f - g encoder pairs were trained, then tested using the Euclidean nearest-neighbor matchingrule in (1). ANE-p is the proposed system using monophone posteriorgram sequences for every X n and monophone sequences for every Y n in Section 3.1, and ANE-g is the same but using graphemesfor Y n . Trip-p uses the same inputs and outputs as ANE-p but is trained by a triplet distance per Sec-tion 4 and (17), under the intent of comparing the proposed method with existing methods (He et al.,2017) in the same setting. For all systems, both f and g were Bi-LSTM networks, with 51 inputdimensions (1 dimension for each monophone, including a “silence” monophone) for f and 50 in-put dimensions (excluding the “silence”) for g . All f networks had 2 layers and 100 nodes in eachdirection, and all g networks had 1 layer and 200 nodes in each direction (with the exception ofANE-p-40, which had 300 nodes), with an additional regression layer applied to the last output ofevery sequence to produce the embedding vector.6anuscript preprintTable 2: Results of name recognition based on nearest-neighbor search in (1) using the acousticencoder f and the text encoder g in tandem, compared with the result from a whole-word FSTrecognizer. ANE-p used f and g trained by the proposed method using monophone transcriptions foreach Y n (see Section 3.1), while ANE-g used grapheme transcriptions for Y n . Trip-p used a tripletdistance with the same data as ANE-p. The numeric sufﬁxes indicate the number of dimensions inthe embedding vectors. Method Accuracy (%)FST 97.5Trip-p-20 65.5Trip-p-30 67.9ANE-g-20 (proposed method) 96.0ANE-g-30 (proposed method) 96.3ANE-p-20 (proposed method) 96.9ANE-p-30 (proposed method) 97.3ANE-p-40 (proposed method) 97.4For training data, speech utterances of varying lengths were extracted from a large set of propri-etary data, and their monophone posterior vector sequences and phonetic transcriptions as describedin Section 3.1 were obtained using a Finite State Transducer(FST)-based ASR with a DNN-HMMacoustic model (Huang et al., 2020). The lengths of the utterances were randomly chosen so thatthe overall distribution of the number of phones per utterance roughly matched that of a databaseof music titles and person names. The process resulted in 4.2M utterances, from which 678K mi-crobatches (160 utterances per microbatch, and 32 microbatches per minibatch) were created fortraining the f for ANE-p and Trip-p, and 783K microbatches for ANE-g. Prepared in a similar man-ner, the cross-validation data (used to stop training) consisted of 1.4M utterances, organized into130K microbatches for ANE-p and Trip-p, and 141K microbatches for ANE-g.For the isolated name recognition experiment, 19,646 audio utterances of spoken names were pre-pared as evaluation data. Each utterance had a corresponding user-dependent list of possible names(i.e., the speaker’s “phonebook”) of varying size, with an average 1,055 names per phonebook. Thephonebook was used to build a whole word FST recognizer for every utterance, using pronunciationsobtained from G2P. The same list of pronunciations was used to generate the g embeddings. TheFST recognizer was a subword-to-word transducer designed to directly consume subword posteriorsequences instead of audio signals, to ensure that the same inputs were used for both the FST andthe f networks. A single acoustic model was used to generate all training and testing inputs for the f networks, as well as the inputs to the FST wherever needed.Table 2 shows the accuracy of each method in the name recognition task, for varying dimensions inthe embedding vectors. For ANE-p and Trip-p, the result of the rule (1) was a best-matching phonesequence, and the match was deemed correct when it was an exact match with any pronunciation inthe phonebook for the reference (manually-transcribed) name. For ANE-g, a crude normalizationwas ﬁrst done on all names (removing special characters and converting to lowercase) for bothtraining and testing data. The result of the rule in (1) was a best-matching normalized name, and thematch was deemed correct when it was an exact match with the normalized reference name. ANE-pwas more accurate than ANE-g because ambiguities in word pronunciation were better resolved.With the triplet distance, competitive accuracy could not be achieved in word recognition.In a second experiment, accuracies were measured for user-independent phonebooks of increasingsize and are shown in Table 3. Because the phonebook sizes were much larger than in Table 2,accuracy differences can be observed more clearly. For simplicity, a uniﬁed phonebook was usedfor all test samples, where the list contained all the reference names and was increasingly paddedwith other random names to attain the different sizes in Table 3. There was increasing confusionbetween the names as the phonebook became larger, especially because most of the names wereshort (e.g. “Lian” and “Layan”).As seen in both Tables 2 and 3, the accuracy of ANE-p increases as the number of dimensions in-crease because less discriminative information is discarded. For every utterance, the average number7anuscript preprintTable 3: Name recognition results for increasing phonebook size using f and g in tandem. Seecaption for Table 2 for description of methods. Because the phonebooks are much larger here than inTable 2, the accuracies are lower overall. When the number of dimensions reaches 40, the Euclideannearest-neighbor search between Acoustic Neighbor Embedding vectors (ANE-p-40) is as accurateas FST decoding. Method Accuracy (%) for phonebook size20K 100K 500K 1MFST 88.1 84.4 76.8 72.1ANE-g-20 79.7 75.7 67.5 62.0ANE-g-30 81.5 78.0 70.3 64.8ANE-p-20 86.5 81.7 73.1 68.2ANE-p-30 88.1 84.1 76.4 71.9ANE-p-40 88.2 84.4 77.0 72.7Table 4: Approximate phonetic match accuracy for isolated names using g . The FST performspoorly because it uses only a general vocabulary where many of the phonebook names are missing.For ANE-p and Trip-p, the g network was used to transform the FST’s 1-best phone sequence intoan embedding vector, then a nearest neighbor search was done on the actual phonebook names. Seecaption for Table 2 for further details on the methods shown.Method Accuracy (%)FST using a default vocabulary 67.8Trip-p-20 86.3Trip-p-30 86.8ANE-p-20 (proposed method) 93.5ANE-p-30 (proposed method) 93.9of dimensions in the original input used by the FST was 3,662 (51 × the length of the utterance,where the mean length is 71.8 frames). This was reduced to 20, 30, or 40 in Table 3, so the dimen-sion reduction was signiﬁcant. However, the name recognition accuracy of ANE-p is essentiallyidentical to that of the FST when the number of embeddings is 40.A third experiment uses the phonebooks in the ﬁrst experiment to perform recognition via approxi-mate phonetic match using only the g networks. First, a general large-vocabulary (900K) continuousASR was run. Since many of the spoken names did not exist in the vocabulary, accuracy is low forthe FST in Table 4. Next, from the FST’s 1-best phone sequence, a g vector was computed, fol-lowed by a nearest neighbor search over the g vectors of the corresponding phonebook. Accuracywas computed based on how often the correct name was chosen, and ANE-p gave the best result. ONCLUSION AND D ISCUSSION

This paper has proposed a method of adapting stochastic neighbor embedding (SNE) to sequentialdata so that spoken words of arbitrary length can be represented as ﬁxed-dimension vectors, calledAcoustic Neighbor Embeddings, where the acoustic confusability between words is represented bythe Euclidean distance between the vectors. The gradients of the loss function was inspected to showthat this neural network training method can be more effective than a known method based on thetriplet distance. The efﬁcacy of nearest-neighbor search using ANE in isolated name recognitionand phonetic matching experiments was also demonstrated. In particular, experiments demonstratedthat its accuracy in isolated word recognition can be identical to that of conventional FST-baseddecoding for vocabulary sizes up to 1 million.It is possible that creating only one embedding for each phone sequence does not sufﬁciently captureall the acoustic variabilities in its pronunciation. SNE (Hinton & Roweis, 2003) already has exten-8anuscript preprintsions that allow multiple embeddings per input in the dimension-reduced space, which we may beable to leverage in future work to make more robust matches.A

CKNOWLEDGMENTS

Thanks to Sibel Oyman, Russ Webb, and John Bridle for helpful comments. R EFERENCES

Samy Bengio and Georg Heigold. Word embeddings for speech recognition. In

Proceedings of the15th Conference of the International Speech Communication Association, Interspeech , 2014.G. Chen, C. Parada, and T. N. Sainath. Query-by-example keyword spotting using long short-termmemory networks. In , pp. 5236–5240, 2015.V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. In , pp. 1–6,2008.T. J. Hazen, W. Shen, and C. White. Query-by-example spoken term detection using phonetic pos-teriorgram templates. In ,pp. 421–426, 2009.Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embed-dings. In

International Conference on Learning Representations (ICLR) , 2017.Geoffrey E Hinton and Sam T. Roweis. Stochastic neighbor embedding. In S. Becker, S. Thrun, andK. Obermayer (eds.),

Advances in Neural Information Processing Systems 15 , pp. 857–864. MITPress, 2003.Z. Huang, T. Ng, L. Liu, H. Mason, X. Zhuang, and D. Liu. SNDCNN: Self-normalizing deepCNNs with scaled exponential linear units for speech recognition. In

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6854–6858,2020.M. Jung, H. Lim, J. Goo, Y. Jung, and H. Kim. Additional shared decoder on Siamese multi-viewencoders for learning acoustic word embeddings. In , pp. 629–636, 2019.K. Levin, K. Henry, A. Jansen, and K. Livescu. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In , pp. 410–415, 2013.Hugh Christopher Longuet-Higgins. The non-local storage of temporal information. In

Proceedingsof the Royal Society of London B , 1968.Andrew L. Maas, Stephen D. Miller, Tyler M. O’Neil, Andrew Y. Ng, and Patrick Nguyen. Word-level acoustic modeling with convolutional vector regression. In

ICML 2012 RepresentationLearning Workshop , 2012.F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uniﬁed embedding for face recognition andclustering. In , pp.815–823, June 2015. doi: 10.1109/CVPR.2015.7298682.S. Settle, K. Levin, H. Kamper, and K. Livescu. Query-by-example search with discriminativeneural acoustic word embeddings. In

INTERSPEECH 2017 , 2017.S. Settle, K. Audhkhasi, K. Livescu, and M. Picheny. Acoustically grounded word embeddingsfor improved acoustics-to-word speech recognition. In