[PDF] Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

Abstract

We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.

Full PDF

SSelf-Supervised Contrastive Learning for Unsupervised PhonemeSegmentation

Felix Kreuk , Joseph Keshet , Yossi Adi Bar-Ilan University Facebook AI Research [email protected]

Abstract

We propose a self-supervised representation learning modelfor the task of unsupervised phoneme boundary detection. Themodel is a convolutional neural network that operates directlyon the raw waveform. It is optimized to identify spectralchanges in the signal using the Noise-Contrastive Estimationprinciple. At test time, a peak detection algorithm is appliedover the model outputs to produce the ﬁnal boundaries. Assuch, the proposed model is trained in a fully unsupervisedmanner with no manual annotations in the form of targetboundaries nor phonetic transcriptions. We compare the pro-posed approach to several unsupervised baselines using bothTIMIT and Buckeye corpora. Results suggest that our approachsurpasses the baseline models and reaches state-of-the-artperformance on both data sets. Furthermore, we experimentedwith expanding the training set with additional examples fromthe Librispeech corpus. We evaluated the resulting modelon distributions and languages that were not seen during thetraining phase (English, Hebrew and German) and showedthat utilizing additional untranscribed data is beneﬁcial formodel performance. Our implementation is available at: https://github.com/felixkreuk/UnsupSeg . Index Terms : Unsupervised Phoneme Segmentation, Self-Supervised Learning, Contrastive Noise Estimation

1. Introduction

Phoneme Segmentation or Phoneme Boundary Detection is animportant precursor task for many speech and audio applica-tions such as Automatic Speech Recognition (ASR) [1, 2, 3],speaker diarization [4], keyword spotting [5], and speech sci-ence [6, 7].The task of phoneme boundary detection has been exploredunder both supervised and unsupervised settings [8, 9, 10, 11].Under the supervised setting two schemes have been consid-ered: text-independent speech segmentation and phoneme-to-speech alignment also known as forced alignment , which is atext-dependent task. In the former setup, the model is providedwith target boundaries, while in the latter setup, the model isprovided with additional information in the form of a set of pro-nounced or presumed phonemes. In both schemes, the goal isto learn a function that maps the speech utterance to the targetboundaries as accurately as possible.However, creating annotated data of phoneme boundariesis a strenuous process, often requiring domain expertise, espe-cially in low-resource languages [12]. As a consequence, un-supervised methods and Self-Supervised Learning (SSL) meth-ods, in particular, are highly desirable and even essential.In unsupervised phoneme boundary detection, also called Figure 1:

An illustration of our model and SSL training scheme.The solid line represents a reference frame z , the dashed linerepresents its positive pair z , and the dotted lines representnegative distractor frames randomly sampled from the signal.blind-segmentation [13, 10], the model is trained to ﬁndphoneme boundaries using the audio signal only. In the self-supervised setting, the unlabeled input is used to deﬁne an aux-iliary task that can generate labeled pseudo training data. Thiscan then be used to train the model using supervised techniques.SSL has been proven to be effective in natural language pro-cessing [14, 15], vision [16], and recently has been shown togenerate a useful representation for speech processing [17, 18].Most of the SSL work in the domain of speech processingand recognition has been focused on extracting acoustic repre-sentations for the task of ASR [17, 18]. However, it remainsunclear how effective SSL methods are when applied to otherspeech processing applications.In this work, we explore the use of SSL for phoneme bound-ary detection. Speciﬁcally, we suggest learning a feature repre-sentation from the raw waveform to identify spectral changesand detect phoneme boundaries accurately. We optimize a Con-volutional Neural Network (CNN) using the Noise ContrastiveEstimation principle [19] to distinguish between pairs of ad-jacent frames and pairs of random distractor pairs. The pro-posed model is depicted in Figure 1. During inference, a peak-detection algorithm is applied over the model outputs to producethe ﬁnal segment boundaries.We evaluate our method on the TIMIT [20] and Buck-eye [21] datasets. Results suggest that the proposed approachis more accurate than other state-of-the-art unsupervised seg-mentation methods. We conducted further experiments withlarger amount of untranscribed data that was taken from the Lib-rispeech corpus. Such an approach proved to be beneﬁcial forbetter overall performance on unseen languages. a r X i v : . [ ee ss . A S ] A ug ur contributions: • We demonstrated the efﬁciency of SSL, in terms ofmodel performance, for learning effective representa-tions for unsupervised phoneme boundary detection.• We provide SOTA results in the task of unsupervisedphoneme segmentation on several datasets.• We provide empirical evidence that leveraging more un-labeled data leads to better overall performance on un-seen languages.The paper is organized as follows: In Section 3 we formally setthe notation and deﬁnitions used throughout the paper as wellas the proposed model. Section 3 provides empirical results andanalysis. In Section 2 we refer to the relevant prior work. Weconclude the paper with a discussion in Section 5.

2. Related work

The task of phoneme boundary detection was explored in var-ious settings. Under the supervised setting, the most commonapproach is the forced alignment. In this setup, previous workmainly involved hidden Markov models (HMMs) or structuredprediction algorithms on handcrafted input features [22, 23].In the text independent setup, most previous work reduced thetask of phoneme segmentation to a binary classiﬁcation at eachtime-step [24, 9]. More recently, [8] suggested using an RNN-coupled with structured loss parameters.Under the unsupervised setting, the speech utterance is pro-vided by itself with no boundaries as supervision. Traditionally,signal processing methods were used to detect spectral changesover time [25, 26, 27, 13], such areas of change were presumedto be the boundary of a speech unit. Recently, Michel et al. [10]suggested training a next-frame prediction model using HMMor RNN. Regions of high prediction error were identiﬁed us-ing peak detection and ﬂagged as phoneme boundaries. Morerecently, Wang et al. [28] suggested training an RNN autoen-coder and tracking the norm of various intermediate gate val-ues (forget-gate for LSTM and update-gate for GRU). To ﬁndphoneme boundaries, similar peak detection techniques wereused on the gate norm over time.In the ﬁeld of self-supervised learning, Van Den Oord et al. [17] and Schneider et al. [18] suggested to train a Convolutionalneural network to distinguish true future samples from randomdistractor samples using a probabilistic contrastive loss. Alsocalled Noise Contrastive Estimation, this approach exploits un-labeled data to learn a representation in an unsupervised man-ner. The resulting representation proved to be useful for a va-riety of downstream supervised speech tasks such as ASR andspeaker identiﬁcation.

3. Model

Following the recent success of contrastive self-supervisedlearning [16, 17, 18], we propose a training scheme for learn-ing useful representations for unsupervised phoneme boundarydetection. We denote the domain of audio samples by

X ⊂ R .The representation for a raw speech signal is therefore a se-quence of samples x = ( x , . . . , x T ) , where x t ∈ X for all ≤ t ≤ T . The length of the input signal varies for differentinputs, thus the number of input samples in the sequence, T , isnot ﬁxed. We denote by X ∗ the set of all ﬁnite-length sequencesover X .Denote by z = ( z , . . . , z L ) a sequence of spectral rep-resentations sampled at a low frequency. Each element in the sequence is an N -dimensional real vector, z i ∈ Z ⊆ R N for ≤ i ≤ L . Every element z i corresponds to a 10 ms frameof audio with a processing window of 30 ms. Let Z ∗ denote allﬁnite-length sequences over Z .We learn an encoding function f : X ∗ → Z ∗ , from thedomain of audio sequences to the domain of spectral represen-tations. The function f is optimized to distinguish betweenpairs of adjacent frames in the sequence z and pairs of ran-domly sampled distractor frames from z . Denote by D ( z i ) theset non-adjacent frames to z i , D ( z i ) = { z j : | i − j | > , z j ∈ z } . (1)Practically we use K randomly selected frames from D ( z i ) ,and denote it by D K ( z i ) ⊂ D ( z i ) . The loss for frame z i isdeﬁned as, ˆ L ( z i , D K ( z i )) = − log e sim( z i , z i +1 ) (cid:80) z j ∈{ z i +1 }∪ D K ( z i ) e sim( z i , z j ) , (2)where sim( u , v ) = u (cid:62) v / || u || || v || denotes the cosine simi-larly between two vectors u and v . Overall, given a training setof m examples S = { x i } mi =1 , we would like to minimize thefollowing objective function, L = (cid:88) x ∈ S (cid:88) z i ∈ f ( x ) ˆ L ( z i , D K ( z i )) (3)During inference, we receive a new utterance x . We thenapply the encoding function to get z = f ( x ) . We set the scorefor a boundary at time i to be the dissimilarity between the i -thframe and the i + 1 -th frame for i = 1 , . . . , L − . That is score( z i ) = − sim( z i , z i +1 ) . (4)Intuitively, score( z i ) can be interpreted as the model’s conﬁ-dence that the next frame z i +1 belongs to a different segmentthan that of the current frame z i . Thus, times with high dis-similarity values are associated with segment changes, and areconsidered as candidates for segment boundaries. We apply apeak detection algorithm over the dissimilarity values, score( z ) to get the ﬁnal segmentation. The frames for which the score ex-ceeds a peak prominence of δ are predicted as boundaries. Theoptimal value of δ is tuned in a cross-validation procedure.Figure 2 presents an example utterance from TIMIT. Thepower spectrum of the utterance is presented in (a), the scorefunction is presented in (b) and the corresponding learned rep-resentation z in (c).

4. Experiments

In this section, we provide a detailed description of the experi-ments. We start by presenting the experimental setup. Then weoutline the evaluation method. We conclude this section withexperimental results and analysis.

The function f was implemented as a convolutional neural net-work, constructed of 5 blocks of 1-D strided convolution, fol-lowed by Batch-Normalization and a Leaky ReLU [30] non-linear activation function. The network f has kernel sizes of (10 , , , , , strides of (5 , , , , and 256 channels perlayer. Finally, the output was linearly projected by a fullyconnected-layer. Overall the model was similar to the one pro-posed by [17, 18]. However, unlike the aforementioned priorable 1: Comparison of phoneme segmentation models using TIMIT and Buckeye data sets. Precision and recall are calculated withtolerance value of 20 ms. Results marked with * are reported using our own optimization.

TIMIT BuckeyeSetting Model Precision Recall F1 R-val Precision Recall F1 R-valUnsupervised Hoang et al. [29] - - 78.20 81.10 - - - -Michel et al. [10] 74.80 81.90 78.20 80.10 69.34 ∗ ∗ ∗ ∗ Wang et al. [28] - - - 83.16 69.61 ∗ ∗ ∗ ∗ Ours 83.89 83.55 83.71 86.02 75.78 76.86 76.31 79.69

Supervised King et al. [24] 87.00 84.80 85.90 87.80 - - - -Franke et al. [9] 91.10 88.10 89.6 90.80 87.80 83.30 85.50 87.17Kreuk et al. [8] 94.03 90.46 92.22 92.79 85.40 89.12 87.23 88.76 (a)(b)(c)

Figure 2:

An illustration of the prediction produced by ourmodel: (a) the original spectrogram; (b) our model’s outputat each time step, red dashed lines represent the ground truthsegmentation; (c) the learned representation z . work, the proposed model does not utilize a context network .Our experiments with such a network led to inferior perfor-mance, and therefore this component was omitted from the ﬁnalmodel architecture.We optimized the model using a batch size of 8 exam-ples and a learning-rate of 1e-4 for 50 epochs. We follow anearly-stopping criterion computed over the validation set. Allreported results are averaged over a set of 3 runs using cross-validation with different random seed values. To get D K weexperimented K ∈ { , , , , } , but did not observe signiﬁ-cant differences in performance.We evaluated our model on both TIMIT and Buckeye cor-pora. For the TIMIT corpus, we used the standard train/testsplit, where we randomly sampled 10% of the training set forvalidation. For Buckeye, we split the corpus at the speaker levelinto training, validation, and test sets with a ratio of 80/10/10.Similarly to [8], we split long sequences into smaller ones bycutting during noises, silences, and un-transcribed segments.Overall, each sequence started and ended with a maximum of20 ms of non-speech . Following previous work on phoneme boundary detection [10,28], we evaluated the performance of the proposed models andbaseline models using precision ( P ), recall ( R ) and F1-scorewith a tolerance level of 20 ms. All experiments were conducted at Bar-Ilan university.

A drawback of the F1-score for boundary detection is itssensitivity to over-segmentation. A naive segmentation modelthat outputs a boundary every 40 ms may yield a high F1-scoreby achieving high recall at the cost of low precision. The au-thors in [31] proposed a more robust complementary metric de-noted as

R-value :R-value = 1 − | r | + | r | r = (cid:112) (1 − R ) + ( OS ) , r = − OS + R − √ (5)where OS is an over-segmentation measure, deﬁned as OS = R/P − . Overall the performance is presented in terms ofPrecision, Recall, F1-score and R-value. In Table 1 we compared the proposed model against several un-supervised phoneme segmentation baselines: Hoang et al. [29],Michel et al. [10], and Wang et al. [28]. We also report re-sults for SOTA supervised algorithms in order to gauge the gapbetween the unsupervised and supervised methods. As the un-supervised baselines did not report results for the Buckeye dataset, and there are no pre-trained models available, we optimizedthese models locally. For a fair comparison we veriﬁed that theperformance of the reproduced models is comparable to the oneoriginally reported on TIMIT. These results are marked with *.Results suggest that the proposed model is superior to thebaseline models over all metrics on both corpora. Notice, for theTIMIT benchmark, the proposed model achieves comparableresults to a supervised method based on a Kernel-SVM [24].Additionally, as opposed to the reported unsupervised baselineswhich are built using Recurrent Neural Networks, our modelis mainly composed of convolutional operations, hence can beparallelized over the temporal axis.

By not relying on manual annotations, SSL methods allowleveraging large unlabeled corpora for additional training data.In this sub-section we explored the effect of expanding thetraining set with additional examples from the Librispeech cor-pus [32]. We evaluated the model under the following schemes:(i) training distribution and test distribution match; (ii) test dis-tribution is different from the training set distribution, but bothare from the same language; and (iii) test and training distri-butions are from different languages. In the following exper- recision Recall F1 R-val50607080 R - v a l u e TIMITTIMIT + Libri (a)

Hebrew

Precision Recall F1 R-val50607080 R - v a l u e TIMITTIMIT + Libri (b)

German

Precision Recall F1 R-val50607080 R - v a l u e BuckeyeBuckeye + Libri (c)

Hebrew

Precision Recall F1 R-val50607080 R - v a l u e BuckeyeBuckeye + Libri (d)

German

Figure 3:

Precision, Recall, F1, and R-value as a function of data added from Librispeech. All models were trained on English trainingdata (sub ﬁgures (a) and (b) on TIMIT while sub ﬁgures (c) and (d) on Buckeye) and evaluated on both Hebrew and German data sets.

Table 2:

Analysis of model performance on the TIMIT andBuckeye test sets before and after augmenting them with exam-ples from Librispeech.

Training set Test set P R F1 R-valTIMIT TIMIT 83.89 83.55 83.71 86.02TIMIT+ TIMIT

Buckeye Buckeye

Table 3:

Analysis of approach when evaluating the model on atest set that originates from a different distribution than that ofthe training set.

Training set Test set P R F1 R-valTIMIT Buckeye 67.48 73.71 70.41 73.10TIMIT+ Buckeye

Buckeye TIMIT iments, we denote by TIMIT+ and Buckeye+ the augmentedversions of TIMIT and Buckeye, respectively. To better matchrecording conditions we chose different partition from Lib-rispeech to augment TIMIT and Buckeye. For TIMIT+ weused the “train-clean-100” partition from Librispeech, whilefor Buckeye+ we used the “train-other-500” partition from Lib-rispeech.

In-domain test set

Results are summarized in Table 2. Sur-prisingly, the models trained on the augmented training setsshowed minor improvements over the original models trainedon the TIMIT and Buckeye data sets. In order to better under-stand the effect of more training data on model performance,we explore the use of out-of-domain test sets in the followingparagraphs.

Out-of-domain test set

We repeated the experiment from theprevious paragraph, however this time with a cross dataset eval-uation. In other words, we optimized a model on TIMIT andtested it on Buckeye and vice-versa. Results are summarizedin Table 3. It can be seen that in cases where the training setand the test set originate from the same distribution (Table 2),adding more data leads to minor improvements in model per-formance. However, when these are coming from mismatched distributions as seen in Table 3, adding more data leads to animprovement in performance. For the TIMIT data set, the R-value for the model trained on TIMIT+ was improved by 3.43points. For the Buckeye data set, we observed a smaller increasein performance.

Multi-lingual evaluation

Finally, we analyzed the effect ofmore training data in the multi-lingual setup. To that end, weevaluated the proposed models, trained on TIMIT, TIMIT+,Buckeye, and Buckeye+ (English data), using two data setsfrom unseen languages. Speciﬁcally, we used a Hebrew dataset [33] and the PHONDAT German data set [34] as test sets.Figure 3 presents the Precision, Recall, F1, and R-value forboth data sets with and without additional training data fromLibrispeech.Results suggest that utilizing additional unlabeled datayields an increase in performance on unseen languages. Forexample, when evaluated on the German data set PHONDAT,the TIMIT+ model improved from an R-value of 55.34 to anR-value of 75.58, while on the Hebrew data set the Buckeye+model improved from an R-value of 79.25 to an R-value of82.63. Notice, the improvement using TIMIT+ is larger byone order of magnitude comparing to the Buckeye+ improve-ment. One possible explanation for that is TIMIT being signif-icantly smaller comparing to Buckeye, hence beneﬁting morefrom additional data. These results highlight the importance ofadditional diverse data sets in cases where there is a mismatchbetween training set and test set languages. Moreover, this sug-gests that the representations obtained by the suggested modelare not tightly coupled with language-speciﬁc features.

5. Discussion and future work

In this work we empirically demonstrated the efﬁciency of self-supervised methods in terms of model performance for thetask of unsupervised phoneme boundary detection. Our modelreached SOTA results on both TIMIT and Buckeye data setsunder the unsupervised setting, as well as showed promising re-sults in terms of closing the gap between unsupervised and su-pervised methods. Moreover, we empirically demonstrated thatusing diverse datasets and leveraging more training data pro-duced models with better overall performance on out-of-domaindata coming from Hebrew and German.For future work, we will explore the semi-supervised set-ting, where we are provided with a limited amount of manuallyannotated data. Additionally, we will explore the use of theproposed method on low-resource languages and under “in-the-wild” conditions. Lastly, we would like to explore the viabilityof such unsupervised segmentation methods in an unsupervisedASR pipeline. . References [1] F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. Schwartz,“Transcribing radio news,” in

Proceeding of Fourth InternationalConference on Spoken Language Processing. ICSLP’96 , vol. 2.IEEE, 1996, pp. 598–601.[2] D. Rybach, C. Gollan, R. Schluter, and H. Ney, “Audio segmenta-tion for speech recognition using segment features,” in . IEEE, 2009, pp. 4197–4200.[3] C.-K. Yeh, J. Chen, C. Yu, and D. Yu, “Unsupervised speechrecognition via segmental empirical output distribution match-ing,” arXiv preprint arXiv:1812.09323 , 2018.[4] M. H. Moattar and M. M. Homayounpour, “A review on speakerdiarization systems and approaches,”

Speech Communication ,vol. 54, no. 10, pp. 1065–1103, 2012.[5] J. Keshet, D. Grangier, and S. Bengio, “Discriminative keywordspotting,”

Speech Communication , vol. 51, no. 4, pp. 317–329,2009.[6] Y. Adi, J. Keshet, E. Cibelli, E. Gustafson, C. Clopper, andM. Goldrick, “Automatic measurement of vowel duration viastructured prediction,”

The Journal of the Acoustical Society ofAmerica , vol. 140, no. 6, pp. 4517–4527, 2016.[7] Y. Adi, J. Keshet, and M. Goldrick, “Vowel duration measurementusing deep neural networks,” in .IEEE, 2015, pp. 1–6.[8] F. Kreuk, Y. Sheena, J. Keshet, and Y. Adi, “Phoneme bound-ary detection using learnable segmental features,” arXiv preprintarXiv:2002.04992 , 2020.[9] J. Franke, M. Mueller, F. Hamlaoui, S. Stueker, and A. Waibel,“Phoneme boundary detection using deep bidirectional lstms,” in

Speech Communication; 12. ITG Symposium . VDE, 2016, pp.1–5.[10] P. Michel, O. R¨as¨anen, R. Thiolliere, and E. Dupoux, “Blindphoneme segmentation with temporal prediction errors,” arXivpreprint arXiv:1608.00508 , 2016.[11] O. Rasanen, “Basic cuts revisited: Temporal segmentation ofspeech into phone-like units with statistical learning at a pre-linguistic level,” in

Proceedings of the Annual Meeting of the Cog-nitive Science Society , vol. 36, no. 36, 2014.[12] M. Goldrick, J. Keshet, E. Gustafson, J. Heller, and J. Needle,“Automatic analysis of slips of the tongue: Insights into the cog-nitive architecture of speech production,”

Cognition , vol. 149, pp.31–39, 2016.[13] O. R¨as¨anen, U. K. Laine, and T. Altosaar, “Blind segmentation ofspeech using non-linear ﬁltering methods,”

Speech Technologies ,pp. 105–124, 2011.[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: Arobustly optimized bert pretraining approach,” arXiv preprintarXiv:1907.11692 , 2019.[16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A sim-ple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709 , 2020.[17] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 ,2018.[18] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” arXiv preprintarXiv:1904.05862 , 2019.[19] M. Gutmann and A. Hyv¨arinen, “Noise-contrastive estimation: Anew estimation principle for unnormalized statistical models,” in

Proceedings of the Thirteenth International Conference on Artiﬁ-cial Intelligence and Statistics , 2010, pp. 297–304. [20] J. S. Garofolo, “Timit acoustic phonetic continuous speech cor-pus,”

Linguistic Data Consortium, 1993 , 1993.[21] M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond,E. Hume, and E. Fosler-Lussier, “Buckeye corpus of conversa-tional speech (2nd release),”

Columbus, OH: Department of Psy-chology, Ohio State University , 2007.[22] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan,“Phoneme alignment based on discriminative learning,” in

NinthEuropean Conference on Speech Communication and Technology ,2005.[23] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son-deregger, “Montreal forced aligner: Trainable text-speech align-ment using kaldi.” in

Interspeech , 2017, pp. 498–502.[24] S. King and M. Hasegawa-Johnson, “Accurate speech segmen-tation by mimicking human auditory processing,” in . IEEE, 2013, pp. 8096–8100.[25] S. Dusan and L. Rabiner, “On the relation between maximumspectral transition positions and phone boundaries,” in

Ninth In-ternational Conference on Spoken Language Processing , 2006.[26] Y. P. Estevan, V. Wan, and O. Scharenborg, “Finding maximummargin segments in speech,” in ,vol. 4. IEEE, 2007, pp. IV–937.[27] G. Almpanidis and C. Kotropoulos, “Phonemic segmentation us-ing the generalised gamma distribution and small sample bayesianinformation criterion,”

Speech Communication , vol. 50, no. 1, pp.38–55, 2008.[28] Y.-H. Wang, C.-T. Chung, and H.-y. Lee, “Gate activation signalanalysis for gated recurrent neural networks and its correlationwith phoneme boundaries,” arXiv preprint arXiv:1703.07588 ,2017.[29] D.-T. Hoang and H.-C. Wang, “Blind phone segmentation basedon spectral change detection using legendre polynomial approx-imation,”

The Journal of the Acoustical Society of America , vol.137, no. 2, pp. 797–805, 2015.[30] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectiﬁer nonlinearitiesimprove neural network acoustic models,” in

Proc. icml , vol. 30,no. 1, 2013, p. 3.[31] O. J. R¨as¨anen, U. K. Laine, and T. Altosaar, “An improved speechsegmentation quality measure: the r-value,” in

Tenth Annual Con-ference of the International Speech Communication Association ,2009.[32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[33] A. Ben-Shalom, J. Keshet, D. Modan, and A. Laufer, “Automatictools for analyzing spoken hebrew.”[34] H. G. Tillmann and B. Pompino-Marschall, “Theoretical princi-ples concerning segmentation, labelling strategies and levels ofcategorical annotation for spoken language database systems,” in