[PDF] Evaluating the reliability of acoustic speech embeddings

Abstract

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

Full PDF

EEvaluating the reliability of acoustic speech embeddings

Robin Algayres , , Mohamed Salah Zaiem , , Benot Sagot , Emmanuel Dupoux , ENS-PSL, EHESS, CNRS, Paris Inria, Paris [email protected], [email protected], [email protected],[email protected]

Abstract

Speech embeddings are ﬁxed-size acoustic representations ofvariable-length speech sequences. They are increasingly usedfor a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However,there is currently no clear methodology to compare or optimisethe quality of these embeddings in a task-neutral way. Here, wesystematically compare two popular metrics, ABX discrimina-tion and Mean Average Precision (MAP), on 5 languages across17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondence autoencoders, siamese). Then we use the ABX andMAP to predict performances on a new downstream task: theunsupervised estimation of the frequencies of speech segmentsin a given corpus. We ﬁnd that overall, ABX and MAP corre-late with one another and with frequency estimation. However,substantial discrepancies appear in the ﬁne-grained distinctionsacross languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bulletmethod for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metricscurrently used to evaluate such embeddings.

Index Terms : unsupervised speech processing, speech embed-dings, frequency estimation, evaluation metrics, representationlearning, k -nearest neighbours

1. Introduction

Unsupervised representation learning is the area of research thataims to extract units from unlabelled speech that are consis-tent with the phonemic transcription [1–4]. As opposed to text,speech is subject to large variability. Two speech sequenceswith the same transcription can have signiﬁcantly different rawspeech signals. In order to work on speech sequences in anunsupervised way, there is a need for robust acoustic represen-tations. To address that challenge, recent methods use speechembeddings , i.e. ﬁxed-size representations of variable-lengthspeech sequences [5–12]. Speech embeddings can be used in many applications, suchas key-word spotting [13–15], spoken term discovery [16–18],and segmentation of speech into words [19–21]. It is convenientto evaluate the reliability of speech embeddings without beingtied to a particular downstream task. One way to do that is tocompute the intrinsic quality of speech embeddings. The basicidea is that a reliable speech embedding should maximise theinformation relevant to its type and minimise irrelevant token-speciﬁc information. Two popular metrics have been used: themean average precision (MAP) [22] and the ABX discrimina-tion score [23]. A speech sequence is a non-silent part of the speech signal (notnecessarily a word). It can be transcribed into a phoneme n -gram. ABX and MAP are mathematically distinct yet they are ex-pected to correlate well with each other as they both evaluate thediscriminability of speech embeddings in terms of their tran-scription. However, [6] revealed a surprising result: the bestmodel according to the ABX, is also the worst one according tothe MAP. Following [6]’s results, we observed that this kind ofdiscrepancies is much more common than we had expected. If amodel performs well according to the MAP and bad accordingto the ABX, which metric should be trusted? For research inthis ﬁeld to go forward, there is a need to quantify the correla-tion of these two metrics.In this paper, we wanted to go further and check that MAPand ABX can also predict performances on a downstream task.Such tasks are numerous, but one of them has not yet receivedenough interest: the unsupervised frequency estimation . We de-ﬁne the frequency of a speech sequence as the number of timesthe phonetic transcription of this sequence appears in the cor-pus. When dealing with text corpora, frequencies can be com-puted exactly with a lookup table and are used in many NLP ap-plications. In the absence of labels, deriving the frequency of aspeech sequence becomes a problem of density estimation. Es-timated frequencies can be useful in representation learning byenabling efﬁcient sampling of tokens in a speech database [7].Also, frequencies could be used for the unsupervised word seg-mentation using algorithms similar to those used in text [19].In Section 2, we present the range of embedding modelsthat can be grouped in ﬁve categories of increasing expectedreliability: hand-crafted, unsupervised, self-supervised, super-vised plus a top-line embedding. In Section 3, we present theMAP and ABX metrics and introduce our frequency estimationtask. In Section 4, we present results on the ﬁve speech datasetsfrom the ZeroSpeech Challenge [2–4]. From these results, wedraw guidelines for future improvements in the ﬁeld of acousticspeech embeddings.

2. Embedding Methods

Neural networks learn representations on top of input features.Therefore we used two types of acoustic features known as thelog-MEL ﬁlterbanks (Mel-F) [24] and the Perceptual LinearPrediction (PLP) [25]. These two features can be consideredas two levels of phonetic abstraction: a high-level one (PLP)and a low-level one (Mel-F). Formally, let us deﬁne a speechsequence s t by x , x ,..., x T , where x i ∈ R n is called a frameof the acoustic features. T is the number frames in the sequence s t . In our setting, these frames are spaced out every 10 ms eachrepresenting a 25 ms span of the raw signal. Holzenberger and al. [6] described a method to create ﬁxed-size embedding vectors that requires no training of neural net- a r X i v : . [ ee ss . A S ] J u l orks: the Gaussian down-sampling (GD). Given a sequence s t , l equidistant frames are sampled and a Gaussian average iscomputed around each sample. It returns an embedding vec-tor e t of size l × n for any size T of input sequences. There-fore, given our two acoustic features, two baselines model arederived: the Gaussian-down-sampling-PLP (GD-PLP) and theGaussian-down-sampling-Mel-F (GD-Mel-F).Similarly, we derived a simple top-line model. Instead ofusing hand-crafted features, we can use the transcription of agiven random segment. Each frame x i in a sequence s t will beassigned a 1-hot vector referring directly to the phoneme beingsaid. This model goes through the same Gaussian averagingprocess to form the Gaussian-down-sampling-1hot (GD-1hot)model. This model is almost the true labels notwithstanding theinformation loss due to compression. A more elaborate way to create speech embeddings is to learnthem on top of acoustic features using neural networks. Specif-ically, recurrent neural networks (RNN) can be trained withback-propagation in an auto-encoding (AE) objective: the RN-NAE [6, 10]. Formally, the model is composed of an encodernetwork, a decoder network and a speaker encoder network.The encoder maps s t to e t , a ﬁxed-size vector. The speakerencoder maps the speaker identity to a ﬁxed size vector spk t .Then, the decoder concatenate e t and spk t and maps them to ˆ s t , a reconstruction of s t . The three networks are trained jointlyto minimise the Mean Square Error between ˆ s t and s t . We consider two popular embedding models. They are alsoencoder-decoders but they use additional side information. Oneis trained according to the Siamese objective [7,26] the other is acorrespondence auto-encoder (CAE) objective [12]. Both mod-els assume a set of pairs of sequences from the training corpus.Positive pair are assumed to have the same transcription, nega-tive pairs, different transcriptions. Let p t = ( s t , s t (cid:48) , y ) where ( s t , s t (cid:48) ) is a pair of sequences of lengths T and T (cid:48) . A binaryvalue y indicates the positive or negative nature of the pair. Wewill see how to ﬁnd such pairs in the next sub-section.The CAE objective uses only positive pairs. The auto-encoder is asked to encode s t into e t and decode it into ˆ s t . Thespeaker encoder network is used similarly as for the RNNAE.To satisfy the CAE objective, ˆ s t has to minimise the MeanSquare Error between ˆ s t and s (cid:48) t . It forces the auto-encoder tolearn a common representation for s t and s t (cid:48) .The Siamese objective does not need the decoder network.It encodes both s t and s t (cid:48) and forces the encoder to learn a sim-ilar or different representation depending on whether the pair ispositive or negative. L s ( e t , e t (cid:48) , y ) = y cos( e t , e t (cid:48) ) − (1 − y ) max(0 , cos( e t , e t (cid:48) ) − γ ) where cos is the cosine similarity and γ is a margin. This latteraccounts for negative pairs whose transcriptions have phonemesin common. These pairs should not have embeddings ’too’ faraway from each other. The CAE and Siamese objective canalso be combined into a CAE-Siamese loss by a weighted sumof their respective loss function [27]. Finding positive pairs of speech sequences is an area of researchcalled unsupervised term discovery (UTD) [16–18, 28]. SuchUTD systems can be DTW alignment based [16] or involve ak-Nearest-Neighbours search [28]. We opted for the latter, asit is both scalable and among the state-of-the-art methods. Itencodes exhaustively all possible speech sequences with an em-bedding model, and used optimised k -NN search [29] to retrieveacoustically similar pairs of speech sequences (see the detailsin [28]). In our experiments, we used the pairs retrieved by k -NN on GD-PLP encoded sequences to train our self-supervisedmodels (CAE,Siamese, CAE-Siamese).As a supervised alternative, it is possible to sample ‘gold’pairs, i.e. pairs of elements that have the exact same transcrip-tion. These ‘gold’ pairs are given to the CAE, Siamese andCAE-Siamese to train supervised models. These supervisedmodels indicate how good these self-supervised models couldbe if we enhanced the UTD system.

3. Evaluation metrics and frequencyestimation

The intrinsic quality of an acoustic speech embedding canbe measured using two types of discrimination tasks: theMAP (also called same-different) [22] and ABX tasks [23].Let us consider a set of n acoustic speech embeddings: (( e , t ) , ( e , t ) , ... ( e n , t n )) where e i are the embeddings and t i the transcriptions. The ABX task creates all possible triplets( e a , e b , e x ) such that: t a = t x and t b (cid:54) = t x . The model is askedto predict 1 or 0 to indicate if e x is of type t a or t b . Such tripletsare instances of the phonetic contrast between t a and t b . For-mally for a given a triplet, the task is to predict: y ( e x , e a , e b ) = d ( e a ,e x ) ≤ d ( e b ,e x ) The error rate on this classiﬁcation task is the ABX score. It isﬁrst averaged by type of contrast (all triplets having the same t a and t b ) then average over all contrasts.The MAP task forms a list of all possible pairs of embeddings( e a , e x ). The model is asked to predict 1 or 0 to indicate if e x and e a have the same type, i.e the same transcription, or not.Formally for a given pair, the model predicts: y ( e a , e x , θ ) = d ( e a ,e x ) ≤ θ The precision and recall on this classiﬁcation task are computedfor various values of θ . The ﬁnal score or the MAP task isobtained by integrating over the precision-recall curve. metric Here, we introduce the novel task of frequency estimation asthe assignment, for each speech sequence, of a positive realvalue that correlate with how frequent the transcription of thissequence is in a given reference corpus . To evaluate the qual-ity of frequency estimates, we use the correlation determinant R between estimation and true frequencies. We compute thisnumber in log space, to take into account the power-law distri-bution of frequencies in natural languages [30]. This coefﬁcient This estimation could be up to a scaling coefﬁcient; the task ofﬁnding exact count estimates is a harder task, not tackled in this paper. s between 0 and 1 and tells what percentage of variance in thetrue frequencies can be explained by the estimated frequencies. k -NN and density estimation We propose to estimate frequencies using density estimation,also called the Parzen-Rosenblatt window method [31]. Let N be the number of speech sequence embeddings. First, these N embeddings are indexed into a k -NN graph, noted G , where alldistances between embeddings are computed. Then, for eachembedding, we search for the k closest embeddings in G . For-mally, given an embedding e t from the k -NN graph G , we com-pute its k distances to its k closest neighbours ( d n ,... d n k ). Thefrequency estimation is a density estimation function κ of the k -NN graph G that has three parameter: a Gaussian kernel β ,the number of neighbours k and the embedding e t . κ G ( e t , β, k ) = k (cid:88) i =1 exp − βd ni This density estimation yields a real number in [1 , k ] , whichwe take as our frequency estimation. We set k to , the max-imal frequency that should be predicted using the transcriptionof our training corpus (the Buckeye, see section 4.1). Then,we must tune β , the dilation of the space of a given embeddingmodel. For each model, we choose β such that it maximisesthe variance of the estimated log frequencies, thereby cover-ing the whole spectrum of possible log frequencies, in our case [0 , log( k )] , which is beneﬁcial for power-law types of distribu-tion. Note that the β kernel cannot be too large (resp. small), asit would predict only high (resp. low) values. Models/methods K-means HC-K-means k -NNGD-1hot 0.67 0.73 RNNAE Mel-F 0.30 0.35

CAE Siamese Mel-F 0.26 0.37

Table 1:

Frequency estimations using K-means, HC-K-meansand k -NN density estimation on a subset of the Buckeye corpus We compared density estimation with an alternative method:the clustering of speech embeddings. Jansen et al. [32] did athorough benchmark of clustering methods on the task of clus-tering speech embeddings. Across all their metrics, the modelthat performs best is Hierarchical-K-means (HC-K-means), animproved version of K-means for a higher computational cost.In particular HC-K-means performs better than GMM HC-K-means is not scalable to our data sets, so we extracted 1% ofthe Buckeye corpus in order to compare it with our method. Asimilar size of corpus is used by Jansen et al. [32].We applied k -NN, K-means and HC-K-means from thepython library scikit-learn [33] on three of our models on thissubset. For K-means and HC-K-means, we used the hyper-parameters that gave the best scores in [32], namely k-means++initialisation and average linkage function for HC-K-means. Onour subset, the ground truth number of clusters is K = 33000 .Yet, we did a grid-search on the value of k that maximises the R score for frequency estimation. We found that K-means andHC-K-means perform better for K = 20000 . It shows thesealgorithms are not tuned to handle data distributed according to the Zipf’s law. Indeed K-means is subject the so-called ‘uni-form effect’ and tends to ﬁnd clusters of uniform sizes [34].Table 1 shows that even by optimising the number of clusters K , the k -NN method outperforms K-means and HC-K-means.

4. Experiments

Five data sets at our disposal from the ZeroSpeech challenge:Conversational English (a sample of the Buckeye [35] corpus),English, French, Xitsonga and Mandarin [2,3]. These are multi-speaker non-overlapping (i.e one speaker per ﬁle) recordings ofspeech. All silences were removed using voice activity detec-tion and corrected manually.Each corpus was split into all possible segmentations toproduce random speech sequences as described in [28]. Ran-dom speech sequences span from ms to s . Shorter than ms sequences may contain less than one phoneme or be ill-pronounced phonemes. Therefore we removed very short se-quences to avoid issues that are out of scope of this study.The Buckeye sample corpus contains 12 speakers and 5hours of speech. The French and English corpora being muchlarger, we reduced their number of speech sequences and speak-ers to the size of the Buckeye. Mandarin and Xitsonga aresmaller data sets and were left untouched. Our encoder-decoder network is a speciﬁc use of a three-layersbi-directional LSTM as described by Holzenberger et al. [6]with hyper-parameters selected to minimise the ABX error onthe Buckeye corpus. The speaker embedding network is a sin-gle fully connected layer with ﬁfteen neurons. Our UTD sys-tem [28] uses the embeddings of the GD-PLP model. A setof speech pairs is returned, sorted by cosine similarity. We se-lected the pairs that have a cosine similarity above . as itseemed to be optimal on the Buckeye corpus according to theABX metric. In comparison, we trained our supervised modelswith ‘gold’ pairs, i.e pairs with the exact same transcription.Each corpus was randomly split into train (90%), dev(5%)and test (5%). Neural networks were trained on the train set,early stopping was done using the development set and metricscomputed on the test set. Speciﬁcally, we trained each modelon the ﬁve training sets using the Buckeye’s hyper-parameters.MAP was ABX were computed on the test sets. Frequencyestimation was computed by indexing the ﬁve training sets andbuilding k -NN graphs. For each element of a given test set,we searched neighbours and estimated frequencies using the k -NN graphs. We used the FAISS [29] library that provides anoptimised k -NN implementation. The results of the two metrics and downstream task are shownin Figure 1 and the following broad trends can be observed.• Supervised models yield substantially lower performancethan the ground truth 1-hot encodings, on all metric and alllanguages. These supervised models have a margin for im-provement as they do not learn optimal embeddings despitehaving access to ground truth labels.• Supervised models outperform their corresponding self-supervised model, in almost all metrics and for all languages.igure 1:

Value of metrics and the downstream task across models, corpora. The average column is the average score over all corpora

It means that self-supervision has also a margin for improve-ment given better pairs from the UTD systems.• Among self-supervised and supervised models, the CAE-Siamese Mel-F takes the pole position. This model seems tobe able to combine the advantages of both training objectives.A result already claimed by [27].• (self) supervised neural neural networks trained on low-levelacoustic features (Mel-F) performs better or equally wellas high-level acoustic features (PLP). This shows that neu-ral networks can learn their own high-level acoustic featuresfrom low-level information.• Self-supervised models are expected to outperform unsuper-vised models because they use side information. Yet manyconﬁgurations do not show this consistently. Only the Buck-eye data set seems consistent, but this dataset is the one onwhich pairs were selected through a grid-search to minimisethe ABX error. This may be due to the variable quality of thepairs found by UTD; better UTD is therefore needed to helpself-supervised models.• Unsupervised models are supposed to be better than hand-crafted models because they can adjust by learning from thedataset. Yet, this is not consistently found. Hand craftedmodels are worse than unsupervised models for ABX and fre-quency estimation but not for MAP.• In detail, which model is best in a particular language dependson the metric.

In Table 2, we quantiﬁed the possibility to observe the discrep-ancies that we have just discussed . We computed the correla-tion R across the three ‘average’ columns. Cross correlationscores range from R = 0 . to . ; the top-line model is notincluded when computing these scores.R Frequency est. MAP ABXFrequency est. 1.0 0.34 0.53MAP 0.34 1.0 0.45ABX 0.53 0.45 1.0Table 2:

Correlation R across the ’average’ column of MAP,ABX and frequency estimation These correlations are low enough to permit sizeable dis-crepancies across metrics and the downstream task. One of ourmodel, the RNNAE Mel-F, epitomises the problem. This modelis comparatively bad according to the MAP but good accord-ing to ABX and the frequency estimation. It means that MAPand ABX reveal different aspect of the reliability of embeddingmodels. Therefore, only large progress according to one metricassures a progress according to an other metric. It shows thelimit of intrinsic evaluation of speech embeddings. Moderatevariations on a intrinsic metric cannot guarantee a progress ona given downstream task.ABX and MAP scores are averages over multiple phoneticcontrasts. These contrasts could be clustered based on their pho-netic frequencies, average lengths or number of phonemes incommon. Such ﬁned-grained analyses can sometimes give un-derstanding divergences across metrics. However, we have beenunable to ﬁnd a categorisation of results that make sense of Fig-ure 1 as a whole. There are currently no fully reliable metricsto assess the intrinsic quality of speech embeddings.

5. Conclusion

We quantiﬁed the correlation across two intrinsic metrics (MAPand ABX) and a novel downstream task: frequency estimation.Although MAP and ABX agree on general categories (like su-pervised versus unsupervised embeddings), we also found largediscrepancies when it comes to select a particular model high-lighting the limits of these intrinsic quality metrics. Howeverconvenient intrinsic metrics may be, they only show partialviews of the overall reliability of a model. We showed usingfrequency estimation that variations on intrinsic quality metricsshould not be accounted for certain progress on downstreamtasks. More attention should be brought on downstream tasksthat have the credit to answer practical problems.

6. Acknowledgements

We thanks Matthijs Douze for useful comments on density esti-mation. This work was funded in part by the Agence Nationalepour la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA In-stitute), CIFAR, and a research gift by Facebook. . References [1] A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudan-pur, K. Church, N. Feldman, H. Hermansky, F. Metze, R. Rose,M. Seltzer, P. Clark, I. McGraw, B. Varadarajan, E. Bennett,B. Borschinger, J. Chiu, E. Dunbar, A. Fourtassi, D. Harwath,C. Lee, K. Levin, A. Norouzian, V. Peddinti, R. Richardson,T. Schatz, and S. Thomas, “A summary of the 2012 jhu clsp work-shop on zero resource speech technologies and models of earlylanguage acquisition,” in , 2013, pp. 8111–8115.[2] M. Versteegh, R. Thiollire, T. Schatz, X.-N. Cao Kam,X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speechchallenge 2015,” 09 2015.[3] E. Dunbar, X. Cao, J. Benjumea, J. Karadayi, M. Bernard,L. Besacier, X. Anguera, and E. Dupoux, “The zero resourcespeech challenge 2017,”

CoRR , vol. abs/1712.04313, 2017.[Online]. Available: http://arxiv.org/abs/1712.04313[4] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. Black, L. Be-sacier, S. Sakti, and E. Dupoux, “The zero resource speech chal-lenge 2019: Tts without t,” 04 2019.[5] H. Kamper, “Truly unsupervised acoustic word embeddingsusing weak top-down constraints in encoder-decoder models,”

CoRR , vol. abs/1811.00403, 2018. [Online]. Available: http://arxiv.org/abs/1811.00403[6] N. Holzenberger, M. Du, J. Karadayi, R. Riad, and E. Dupoux,“Learning word embeddings: Unsupervised methods for ﬁxed-size representations of variable-length speech segments,” 09 2018,pp. 2683–2687.[7] R. Riad, C. Dancette, J. Karadayi, N. Zeghidour, T. Schatz,and E. Dupoux, “Sampling strategies in siamese networksfor unsupervised speech representation learning,”

CoRR , vol.abs/1804.11297, 2018. [Online]. Available: http://arxiv.org/abs/1804.11297[8] A. L. Maas, S. D. Miller, T. M. Oneil, and A. Y. Ng, “Word-levelacoustic modeling with convolutional vector regression.”[9] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments inlow-resource settings,” 12 2013, pp. 410–415.[10] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-y. Lee, and L.-S. Lee,“Audio word2vec: Unsupervised learning of audio segment rep-resentations using sequence-to-sequence autoencoder,” 03 2016.[11] Y. Chung, W. Weng, S. Tong, and J. R. Glass, “Unsupervisedcross-modal alignment of speech and text embedding spaces,”

CoRR , vol. abs/1805.07467, 2018. [Online]. Available: http://arxiv.org/abs/1805.07467[12] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A com-parison of neural network methods for unsupervised representa-tion learning on the zero resource speech challenge,”

Annual Con-ference of the International Speech Communication Association ,2015.[13] T. J. Hazen, W. Shen, and C. White, “Query-by-example spokenterm detection using phonetic posteriorgram templates,” in , 2009, pp. 421–426.[14] K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic in-dexing for zero resource keyword search,” in , 2015, pp. 5828–5832.[15] Y. Wang, H. Lee, and L. Lee, “Segmental audio word2vec:Representing utterances as sequences of vectors with applicationsin spoken term detection,”

CoRR , vol. abs/1808.02228, 2018.[Online]. Available: http://arxiv.org/abs/1808.02228[16] A. S. Park and J. R. Glass, “Unsupervised pattern discovery inspeech,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 16, no. 1, pp. 186–197, 2008. [17] C.-y. Lee, T. J. O’Donnell, and J. Glass, “Unsupervisedlexicon discovery from acoustic input,”

Transactions of theAssociation for Computational Linguistics

Cognition , vol. 112, pp. 21–54, 04 2009.[20] H. Kamper, A. Jansen, and S. Goldwater, “A segmental frame-work for fully-unsupervised large-vocabulary speech recogni-tion,”

Computer Speech and Language , 2017.[21] K. Kawakami, C. Dyer, and P. Blunsom, “Unsupervisedword discovery with segmental neural language models,”

CoRR , vol. abs/1811.09353, 2018. [Online]. Available: http://arxiv.org/abs/1811.09353[22] M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapidevaluation of speech representations for spoken term discovery,”in

INTERSPEECH , 2011.[23] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, andE. Dupoux, “Evaluating speech features with the minimal-pairabx task: Analysis of the classical mfc/plp pipeline,”

INTER-SPEECH 2013: 14th Annual Conference of the InternationalSpeech Communication Association , pp. 1–5, 01 2013.[24] S. B. Davis and P. Mermelstein, “Comparison of parametric rep-resentations for monosyllabic word recognition in continuouslyspoken sentences,”

ACOUSTICS, SPEECH AND SIGNAL PRO-CESSING, IEEE TRANSACTIONS ON , pp. 357–366, 1980.[25] H. Hermansky, “Perceptual linear predictive (plp) analysis ofspeech.”

The Journal of the Acoustical Society of America , vol.87 4, pp. 1738–52, 1990.[26] R. Thiolli`ere, E. Dunbar, G. Synnaeve, M. Versteegh, andE. Dupoux, “A hybrid dynamic time warping-deep neural net-work architecture for unsupervised acoustic modeling,” in

INTER-SPEECH , 2015.[27] P. Last, H. A. Engelbrecht, and H. Kamper, “Unsupervised featurelearning for speech using correspondence and siamese networks,”

IEEE Signal Processing Letters , vol. 27, pp. 421–425, 2020.[28] A. Thual, C. Dancette, J. Karadayi, J. Benjumea, and E. Dupoux,“A k-nearest neighbours approach to unsupervised spoken termdiscovery,” in , 2018, pp. 491–497.[29] J. Johnson, M. Douze, and H. J´egou, “Billion-scale similaritysearch with gpus,” arXiv preprint arXiv:1702.08734 , 2017.[30] G. K. Zipf,

The psycho-biology of language , 1935.[31] E. Parzen, “On estimation of a probability density function andmode,”

Ann. Math. Statist. , vol. 33, no. 3, pp. 1065–1076,09 1962. [Online]. Available: https://doi.org/10.1214/aoms/1177704472[32] H. Kamper, A. Jansen, S. King, and S. Goldwater, “Unsupervisedlexical clustering of speech segments using ﬁxed-dimensionalacoustic embeddings,” in , 2014, pp. 100–105.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine learning in Python,”

Journal of Machine Learning Research , vol. 12, pp. 2825–2830,2011.[34] J. Wu,

The Uniform Effect of K-means Clustering , 07 2012, pp.17–35.[35] M. A. Pitt, K. Johnson, E. Hume, S. Kiesling, and W. Raymond,“The buckeye corpus of conversational speech: Labeling conven-tions and a test of transcriber reliability,”