[PDF] GIPFA: Generating IPA Pronunciation from Audio

Abstract

Transcribing spoken audio samples into International Phonetic Alphabet (IPA) has long been reserved for experts. In this study, we instead examined the use of an Artificial Neural Network (ANN) model to automatically extract the IPA pronunciation of a word based on its audio pronunciation, hence its name Generating IPA Pronunciation From Audio (GIPFA). Based on the French Wikimedia dictionary, we trained our model which then correctly predicted 75% of the IPA pronunciations tested. Interestingly, by studying inference errors, the model made it possible to highlight possible errors in the dataset as well as identifying the closest phonemes in French.

Full PDF

GGIPFA: G

ENERATING

IPA P

RONUNCIATION FROM A UDIO

Xavier Marjou [email protected] A BSTRACT

Transcribing spoken audio samples into International Phonetic Alphabet (IPA) has long been reservedfor experts. In this study, we instead examined the use of an Artiﬁcial Neural Network (ANN)model to automatically extract the IPA pronunciation of a word based on its audio pronunciation,hence its name Generating IPA Pronunciation From Audio (GIPFA). Based on the French Wikimediadictionary, we trained our model which then correctly predicted of the IPA pronunciations tested.Interestingly, by studying inference errors, the model made it possible to highlight possible errors inthe dataset as well as identifying the closest phonemes in French. K eywords Audio · Transcription · Phonemes · Artiﬁcial Neural Network · Dataset

Some dictionaries like Wiktionary offer both to listen to words spoken by real users and to read pronunciations in theform of the International Phonetic Alphabet (IPA).However, in the case of the French Wiktionary, the IPA transcripts are subject to a small percentage of errors.Several reasons can explain these errors. First, Wiktionary contributors may not be IPA experts; second, even IPAexperts sometimes may make careless mistakes; third, the audio may be inconsistent because it is generally recordedindependently without taking IPA pronunciation into account, which can lead to important discrepancies; fourth, somesounds like /o/ and /O/ may be very close to each other and can depend on the speaker.This article examines whether such errors could be avoided by using an Natural Language Processing (NLP) tool toautomatically extract IPA pronunciation from audio pronunciation.To this purpose, we made use of Automatic Speech Recognition (ASR), which has already been the subject of in-depthstudies. In particular, many recent implementation approaches have successfully used a deep Artiﬁcial Neural Network(ANN) as in [1] and [2], hence our choice to design a new ANN called Generating IPA Pronunciation From Audio(GIPFA). In order to train and test it, we also assembled a new experimental dataset based on samples from theFrench Wiktionary.Despite a dataset containing an unknown percentage of erroneous data samples, our GIPFA model succeeded inproviding reasonable accuracy. Although it failed to replace IPA experts, it nevertheless proved to be particularly usefulin identifying the biggest errors in the dataset.

In order to predict the IPA pronunciation of a word, two main steps were necessary: identifying a relevant dataset anddesigning an ANN model capable of inferring an IPA pronunciation from an audio pronunciation. a r X i v : . [ c s . C L ] J un ord Audio ﬁlename IPA pronunciationbonjour LL-Q150 (fra)-LoquaxFR-bonjour.wav b˜OZuK Table 1: Dataset

Our dataset came from a Wikimedia dump containing all pages and articles of the French Wiktionary. In this dump,each page generally contains three essential features: one word along with n main IPA pronunciations and m examplesof audio pronunciations recorded by by several speakers. • A word is a text string containing Unicode characters. The word terminology has to be taken in the broadsense as a Wiktionary word encompasses common names, proper names words, abbreviations, numbers andeven sayings. Although our ANN did not use it, we kept the word in our dataset for debugging purposes, inorder to have the possibility to ﬁnd back the Wiktionary page containing the pronunciations. • An audio pronunciation refers to an audio ﬁle generally recorded in a Waveform Audio File (WAV) formatcontaining the pronounced word. Wiktionary pages can contain one or more audio pronunciations for the sameword. When an audio ﬁle is generated with LinguaLibre (LL) software, it beneﬁts three useful features: theaudio ﬁle is under the Creative Commons sharing license ; the ﬁle can be fecthed from Wikimedia Commons based on its audio ﬁlename; the audio ﬁlename also contains a label representing a user name which can beused to identify audio ﬁles generated by users. • An IPA pronunciation is a text string containing IPA symbols. For learning purposes, each audio pronunciationof a word should ideally be associated to a single IPA pronunciation transcribing this precise audio content; aranking of the most common pronunciations might also be calculated and indicated in the page describingthe word. However, most words have a single IPA pronunciation (i.e. n = 1 ) even when multiple audiopronunciations are available. Although some words have multiple IPA pronunciations (e.g. coût ), a Wiktionarypage rarely indicates which of these pronunciations corresponds to an audio ﬁle.For our purpose, we restricted our dataset to samples containing: • Words of the French Wiktionary ; • French words, given that each Wiktionary describes words of several languages; • Words with a single IPA pronunciation, given that multiple IPA per audio sample introduce ambiguities; • IPA pronunciation containing symbols making part of the traditional French phonemes (i.e. ’ i ’, ’ e ’, ’ E ’, ’ a ’,’ A ’, ’ O ’, ’ o ’, ’ u ’, ’ y ’, ’ ø ’, ’ œ ’, ’ @ ’, ’ ˜E ’, ’ ˜A ’, ’ ˜O ’, ’ ˜œ ’, ’ j ’, ’ w ’, ’ ’, ’ p ’, ’ k ’, ’ t ’, ’ b ’, ’ d ’, ’ g ’, ’ f ’, ’ s ’, ’ S ’, ’ v ’, ’ z ’, ’ Z ’,’ l ’, ’ K ’, ’ m ’, ’ n ’, ’ ñ ’, ’ N ’); • IPA pronunciation containing less than phonemes, in order to keep our ANN model reasonable in sizeregarding our resources; • Audio ﬁles recorded with LL, in order to easily fetch audio ﬁles.We also discarded 9 symbols that appear as optional in the IPA pronunciation of the French Wiktionary (’ > ’, ’.’, ’ ’, ’ < ’,’ ’ ’ and ’ : ’, ’(’, ’)’, ’-’).The resulting dataset contained samples from different speakers. As depicted in Table 1, each samplecontained three features: a word , an audio ﬁlename and an IPA pronunciation .In addition, we also pre-processed the WAV ﬁles to have a ﬁx length of seconds, and then converted them into anMel-Frequency Cepstral Coefﬁcients (MFCC) format so that they could serve as direct inputs into our model. Althoughprocessing audio ﬁles under a WAV format would be possible as in [3], it requires signiﬁcant RAM memory, hence ourchoice to transpose them into a MFCC format, as usually performed in many studies like in [4] and [5]. https://dumps.wikimedia.org/frwiktionary/20200501/ https://lingualibre.org https://creativecommons.org/licenses/by-sa/4.0/ https://commons.wikimedia.org/ https://fr.wiktionary.org/ C onv1d R e l u C onv1d R e l u L S T M L S T M L i n ea r IPAdataFigure 1: The GIPFA ANN model used for transcribing audio samples into IPA samples.

We modeled our GIPFA ANN as depicted in Figure 1. It contains typical components found in many ANN modelsused for ASR. However, given that we only had to translate a single word per sample, we did not use any Transformercomponent [7]. Each audio input sample (MFCC data) ﬁrst traversed a stack of two 1D convolution layer (Conv1D)layers to extract the shape of the MFCC data; followed by two Long Short Term Memory (LSTM) ﬁlters [8] to extracttemporal sequences; and ﬁnally followed by a linear layer in order to allow a Connectionist Temporal Classiﬁcation(CTC) loss calculation [9]. We did not allow the succession of two identical phonemes because this is rare in Frenchwords. In addition, we used an AdamW optimizer [10] with a learning rate of × − . We used Ray Tune [11] for ﬁne-tuning our hyperparameters with respect to accuracy results. It led us to identify a setof best values among from a larger set of experimented values as summarized in Table 2. The resulting model contained trainable parameters. Slight variations in the best values did not lead to signiﬁcant improvement. Although itis believed that wider network may have lead to better results [12], we limited our model to these M parameters dueto our limited computing resources.Hyperparameter Tested values Best valuemfcc_coefﬁcients 40 conv1d_activ none, relu reluconv1d_layers , , , conv1d_units , ,

128 128 conv1d_bn False, True Truelstm_layers , , lstm_units , ,

512 512 lstm_dropout . , . , . . lstm_bidir False, True Truelstm_bn False, True Trueoptimizer Adam, AdamW AdamWlr 1e-3, 1e-4 1e-4Table 2: GIPFA hyperparameters values For the training step, we used

79 326 samples distributer over batches of samples ( training batches and evaluation batches). During a pre-processing step, all audio samples were standardized with a the mean ( − . )and standard deviation ( . ) pre-observed on the dataset.Before each run, the data samples were randomly shufﬂed. Each training run took approximately 10 epochs of 3 minuteseach on a single GPU (GeForce RTX 2080, 8 GB). For the testing step, we used unseen samples to evaluate the performances of the GIPFA ANN.3 .2.5 Accuracy

Since solving the translation problem requires correct inference of the entire IPA pronunciation, we simply set foreach tested sample an accuracy of when our model predicted an IPA pronunciation equal to the tested target IPApronunciation, or otherwise. After each training run, we then calculated the average accuracy across all samples (i.e.average accuracy between . and . ).We performed 11 runs (with one training step and one test step for each) to allow reasonable conﬁdence in the averageaccuracy results. We ﬁnally computed a mean accuracy and the associated standard deviation (std) on the 11 tests.Since the dataset had not been studied further, there was unfortunately no baseline reference to challenge our results. To our knowledge, no study has examined the exactness and coherence of the audio ﬁles and IPA pronunciations of theFrench Wiktionary, meaning that the dataset may contain errors, making it difﬁcult to assess whether a prediction errorcomes from the dataset or from the ANN.In order to obtain more in-depth information on errors, we therefore also calculated three other metrics related to the samples in the dataset: • At the word level – Edit distance error : the Levenshtein distance [6] between the predicted IPA pronunciation and the targetIPA pronunciation, in order to estimate how far the prediction was from the target. • At the phoneme level – Average phoneme accuracy : the percentage of correct translations for each phoneme; – Error pair percentage : Since each of the 37 target phonemes can be incorrectly translated as one ofthe other 36 phonemes, the results can contain up to 37 * 36 categories of error pairs. To assess therepresentativeness of each pair, we calculated its number of occurrences divided by the total number ofphonemic errors.The code is available on Github . In this section, we describe two different results: ﬁrst, the accuracy of the model; then a more detailed observation oferrors at the phoneme level and at the word level.

Training samples Tested samples Pronunciation accuracy Pronunciation accuracy(mean) (std) .

75 0 . Table 3: Pronunciation accuracyTable 3 presents the accuracy results which were consistent consistent across the 11 runs; our GIPFA ANN modelsuccessfully predicted around IPA pronunciations out audio samples.Correctly inferred pronunciations had a mean length of 7.51 whereas incorrectly inferred pronunciations had a meanlength of 8.65 thus indicating a slightly higher probability of error as the length of the IPA pronunciation increased.

Performing inferences on samples of the dataset allowed to better understand the reasons for the errors. Code available at https://github.com/marxav/gipfa .2.1 Phoneme Accuracy Table 4 reports the translation accuracy of each phoneme. One phoneme (/ A /) had poor accuracy (less than ), ﬁvephonemes (/ o /, / N /, / ˜œ /, / ñ / and / oe /) had moderate accuracy (between and ) while the remaining thirty-onephonemes had high accuracy (over ).Target Correct Incorrect Averagephoneme translation translation accuracy A

392 605 0 . o .

615 2485 0 . N

40 17 0 . ˜œ

241 89 0 . ñ

697 110 0 . œ . . E

15 859 1472 0 . @ . g . ø . O

18 655 1074 0 . e

30 018 1608 0 . w . v . u . ˜E . j

12 567 547 0 . b

12 753 434 0 . n

13 165 472 0 . p

14 845 464 0 . l

23 181 684 0 . ˜A

13 704 226 0 . f . y . z . i

34 772 664 0 . d

15 975 323 0 . k

23 159 503 0 . S . a

44 575 707 0 . m

17 334 313 0 . K

47 221 799 0 . Z . t

29 691 713 0 . ˜O . s

30 018 400 0 . Table 4: Average accuracy of each phonemeTo better observe the details, we also detailed these phoneme translation errors in a confusion matrix as shown in Figure2. Each row in the matrix represented a target phoneme while each column represented the distribution of predictedphonemes. For instance, it turned out that the target phoneme / E / was of the time predicted as / e /, as / E / and / a /. Notable outliers were four large numbers outside the diagonal: of / A / seemed poorly predicted as an / a /; of / o / as / O /; of / ˜œ / as / ˜E /; and of / N / as / g /; It turned out that, like humans, the ANN had difﬁculties indifferentiating near elementary sounds. 5igure 2: Confusion Matrix Table 5 represents the proportion of the error associated with each phoneme pair compared to the total errors of all pairsof phonemes. Interestingly, only three pairs of phonemes generated of all errors: (/ o /, / O /) ( of all errors), (/ e /,/ E /) ( of all errors) and (/ a /, / A /) ( of all errors).Target Predicted Percentage ofphoneme phoneme all errors o O . e E . E e . A a . O o . t d . E a . a A . Table 5: Most encountered error pairs

Computed Levenshtein distancesamples mean, std . , . Table 6: Levenshtein distance6able 6 reports a small mean Levenshtein distance and gives assurance that there is strong consistency between theaudio content and the IPA pronunciation for the samples in the dataset studied.Word IPA Target IPA Prediction Levenshtein distance1337 / lit / / mitas˜AtK˜AmzOt / 13agent innervant / aZ˜AinEKv˜A / / go / 11brut de décoffrage / bKytd@dekOfKaZ / / sbOKdedtOK / 10Michel / miSEl / / st˜Ed@s˜AmSEl / 10phalange proximale / fal˜AZpKOksimal / / fal˜AZ / 9analyse calorimétrique / analOgSimik / / analiskalOKimetik / 9àtha / at˜On ˜œblavi / / ata / 9Wikitionnaire / gazaefEd@sfEK / / gOZifisølEK / 9arrondir par défaut / aK˜OdiKpaKdefo / / aK˜AdiK / 8Luxembourg / lyks˜AbuK / / yseKzOnb / 8Table 7: Top-10 pronunciations with the highest Levenshtein distanceHowever, Table 7 focuses on the most extreme outliers by reporting the 10 samples with the highest Levenshteindistance. Upon investigation, it was found that all these samples contained either an error in the audio sample (e.g.bad word spoken or no word spoken at all) or an error in the target IPA pronunciation, which meant that all theseerrors were in the dataset itself. These results therefore suggest that data samples whose pronunciations have a highLevenshtein distance probably contain an error.Additional work would be required to identify the best threshold distance to identify possible errors in the dataset. Previous work has documented the effectiveness of the ANN model for ASR. However most studies have focused onthe direct translation of audio samples into words.In this study, we focused instead on the translation of audio samples into phonemes. We ﬁrst proposed an ANNpredicting with accuracy the French pronunciations of the French Wiktionary.Since to our knowledge no existing work has been done on this speciﬁc task and dataset, there was no basis forcomparison or assurance as to the accuracy and consistency of the data.We have shown that the translations of certain phonemes were more problematic since some phonemes are closeelementary sounds ( /o/ and /O/ , /E/ and /e/ , /A/ and /a/ ) and thus difﬁcult to be distinguished. Future work mayconsider carefully checking the audio samples and IPA pronunciations containing these close phonemes, which wouldin turn enhance the efﬁciency of the ANN. In addition, future work could also involve synthesized audio examples anduse them as additional samples to reinforce training data.However, we have also shown that the Levenshtein distance between our GIPFA prediction and the target (as it exists inthe dataset and therefore in the Wiktionary) can highlight the most suspect samples in the dataset. Such results thereforesuggest that our GIPFA ANN would be a valuable tool to help verify the consistency of the Wiktionary regardingpronunciation.Therefore, integrating it into a tool like LL should be useful in order to suggest an IPA transcription. It could even beused to suggest an IPA transcription associated with each recorded audio sample, since having one IPA transcription peraudio ﬁle should further improve the performances of the ANN.Finally, we believe this method should be applicable to other languages provided that a sufﬁcient number of trainingsamples are available. Acknowledgements

We thank all Wiktionary and LinguaLibre contributors for their contributions without which there would be no wonderfulfree dictionary and no free dataset either. 7 eferences [1] Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang,and Yonghui Wu. Contextnet: Improving convolutional neural networks for automatic speech recognition withglobal context. arXiv preprint arXiv:2005.03191 , 2020.[2] Amit Das, Jinyu Li, Guoli Ye, Rui Zhao, and Yifan Gong. Advancing acoustic-to-word ctc model with attentionand mixed-units.

IEEE/ACM Transactions on Audio, Speech, and Language Processing , 27(12):1880–1892, 2019.[3] Tara Sainath, Ron J. Weiss, Kevin Wilson, Andrew W. Senior, and Oriol Vinyals. Learning the speech front-endwith raw waveform cldnns. In

Interspeech , 2015.[4] Noelia Alcaraz Meseguer. Speech analysis for automatic speech recognition. Master’s thesis, Institutt forelektronikk og telekommunikasjon, 2009.[5] Md Mahadi Hasan Nahid, Bishwajit Purkaystha, and Md Saiful Islam. Bengali speech recognition: A doublelayered lstm-rnn approach. In , pages 1–6. IEEE, 2017.[6] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.

Reports of theUSSR Academy of Sciences , 163(4):845–848, 1965.[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In

Advances in neural information processing systems , pages5998–6008, 2017.[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation , 9:1735–1780, 1997.[9] Alex Graves. Connectionist temporal classiﬁcation. In

Supervised Sequence Labelling with Recurrent NeuralNetworks , pages 61–93. Springer, 2012.[10] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.[11] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol,Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging { AI } applications. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } ,pages 561–577, 2018.[12] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep doubledescent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292arXiv preprint arXiv:1912.02292