A Generative Model of a Pronunciation Lexicon for Hindi
AA Generative Model of a Pronunciation Lexicon for Hindi
Pramod Pandey , Somnath Roy Centre for Linguistics, Jawaharlal Nehru University, Delhi [email protected], [email protected]
Abstract
Voice browser applications in Text-to- Speech (TTS) and Au-tomatic Speech Recognition (ASR) systems crucially dependon a pronunciation lexicon. The present paper describes themodel of pronunciation lexicon of Hindi developed to auto-matically generate the output forms of Hindi at two levels, the < phoneme > and the < PS > (PS, in short for Prosodic Struc-ture). The latter level involves both syllable-division and stressplacement. The paper describes the tool developed for generat-ing the two-level outputs of lexica in Hindi. Index Terms : Hindi PLS, Prosodic Structure, Stress
1. Introduction
Hindi demonstrates considerable regularity in mapping the or-thographic representation to pronunciation and is thus very suit-able for automatic generation of pronunciation by a programmeseeded with hand-written rules. In this paper we describe themodel developed to automatically generate output forms at twolevels—Phonemic and Prosodic Structure (PS) —from the in-put forms in the Devnagari script in which Hindi is written.The paper is organized as follows. Section 2 presents the salientpoints of a pronunciation lexicon relevant to the present work.Section 3 describes the pronunciation lexicon specifications forHindi. Section 4 briefly describes Devanagari script and theHindi Writing system. Section 5 gives an account of the rule-based generative model for the generation of Hindi pronuncia-tion lexica being discussed in the present paper. Section 6 re-ports the results of the evaluation of the programmes and section7 presents the conclusions.
2. Pronunciation Lexicon
In order for the information on pronunciation in a given lan-guage to be easily accessible for voice browser applications, theWorld Wide Consortium (W3C) Voice Browser Activity Lexi-con Standards has recently proposed an advanced set of stan-dards for pronunciation lexicons in the publication entitled
Pro-nunciation Lexicon Specification (PLS) Version 1.0 . The specifications of the features of a pronunciation lexiconproposed in the W3C proposals are exhaustive [1]. They canbe summed up as follows:(1) • PLS is intended to be the standard format of thedocuments referenced by the < lexicon > element of A tool for letter-to-sound conversion for Hindi under GNU licenseis available at an open source link ”https://sourceforge.net/projects/pls-for-indic-languages/” and can be downloaded from this link for Win-dows and Linux machines. The term ’phonemic’ is being used here in the broad sense in whichit has been used in W3C recommendations.
SSML. The lexicon element is an external documentthat is loaded by the SSML document. There can bemore than one PLS document.• A pronunciation lexicon is referenced by a SRGSgrammar in an ASR processor. It can allow multiplepronunciations of a word in the grammar to representdifferent speaking styles.• A pronunciation is specified in terms of the ”alphabet”attribute of a PLS document. The main valid value ofthe ”alphabet” attribute is the ”ipa” or vendor-definedstrings of the form ”x-organization” or ”x-organization-alphabet”, as for example,”x-SAMPA”.• As a PLS processor must support ”ipa” as the value ofthe < alphabet > attribute, it must support the Unicoderepresentations of the phonetic characters of IPA. Theseinclude a set of vowel and consonant symbols, a syllabledelimiter, diacritics, symbols for prosodic features ofstress, tone and intonation.• A word in the PLS markup language is described by fol-lowing elements. The details related to PLS markup lan-guage can be found in [1] – The first element is < lexicon > . This element hasmany attributes namely—i. version ii. xml:baseiii. xmlns iv. xml:lang v. alphabet – The second elment is < meta > or < metadata > .The meta element is optional and specified by at-tributes like name,http-equiv and content. – Other necessary elements besides < lexicon > are < lexeme > , < grapheme > and < phoneme > . – Other optional elements are < alias > and < exam-ple > . The term ’Letter-to-Sound’ (LTS) rules or ’Grapheme-to-Phoneme’ (G2P) rules refers to rules that change orthographyinto pronunciation. For the suitability of the term to the Dev-nagari script, in which Hindi is written, a more specific term,’Akshara-to-Sound’ (ATS) rules in place of LTS or G2P rules,has been recently recommended and has come to be in use[2, 3].Devnagari is a prominent representative of alpha-syllabicscripts derived from Brahmi and used for major Indic languagesas well as languages outside India (e.g. Tibetan, Hankul, etc.)[4]. The linguistic level of sounds to which the Devnagariscript in Hindi relates is both phonemic and phonetic.The main features of A/LTS rules are discussed below. a r X i v : . [ c s . C L ] M a y or languages with irregular relations between ortho-graphic and pronunciation levels, for instance, English, LTSrules can be generated automatically following data-drivenapproach. For languages with regular relations betweenorthographic and pronunciation levels, e.g. Hindi and Bangla,A/LTS rules can be generated by hand-written rules.Some speech synthesis systems such as FESTIVAL [5]have provision for both data-driven and hand-written rules.The basic form of a hand-written rule is the following (seehttp://festvox.org/bsv/x1429.html):(2) (LC[alpha]RC= > beta)Where LC is ”left context on zero or more input symbols”and RC is ”right context on zero or more input symbols”.Writing rules by hand can be very difficult, as it mayrequire a full grasp of the factors involved in determiningthe pronunciations of words and longer utterances. Thefollowing points have to be taken into consideration in writinghand-written A/LTS rules.(3) • A/LTS rules may refer to word boundary but cannot referto previous or following words.• Attention must be given to avoiding conflicting rules.• The following factors lend complexity to A/LTS rules[6] – POS – Morphological boundary – Stress marks – Exceptions due to historical change – Borrowings with different A/LTS rules than regu-lar forms – The length of LC – The length of RCThe A/LTS rules for Hindi, as we shall see below, take thesefactors into account and give a fairly efficient programme forconverting written Hindi words into IPA representations.For an overview of earlier studies on Letter-to-Sound rulesin general as well as Indic languages[3]. It should be pointed outhere, however, that attempts at Letter-to-Sound rules for Hindihave been few and far between and have focused on individualaspects, but not all aspects, of the pronunciation of words. Forexample [7, 8] present comprehensive accounts of the SchwaDeletion process,[9] presents A/LTS rules for segments in Urdu,and [10] proposes the use of the syllable for text-processing ina TTS system for Indian languages. In the model of PLS de-scribed here, all the aspects of segmental and prosodic featuresare included in a single program for a pronunciation lexicon ofHindi.
3. Pronunciation Lexicon Specifications forHindi
The features included for the pronunciation lexica of Hindi arethe following:(4) • Letter-to-sound conversion using IPA. • Marking of syllable structures and labeling them asprominent and non-prominent for further TTS and ASRapplications.• Preparation of the systematic variations in the pronunci-ations of words in Standard Hindi in two lexica, StandardFormal and Standard Colloquial varieties.• The surface pronunciations of words with stress marks.The features mentioned above have been taken into ac-count, with multiple uses in mind for the learners of Hindi,TTS and ASR applications, as well as users of screen-readingsoftware in Hindi, among others. Two levels of pronunciationsare being produced: the level of segmental IPA symbols withprosodic structure in terms of syllable division and prominence( < PS > ), and the level of segmental IPA symbols with stressmarks ( < phoneme > ), as illustrated below in the Figures 1 & 2in the form of screenshots of the input-output generation of twowords.Figure 1: Screenshots of the input-output generation of twowords
4. Hindi Writing and Devnagari Script
The official script of Hindi is Devnagari or Devanagari. It be-longs to a class of writing systems known as alpha-syllabic, thatis, a combination of alphabetic and syllabic systems. Writingsystems are alphabetic, if their units stand for phonemic/ pho-netic vowels and consonants (e.g. English, German). They aresyllabic, if each orthographic unit in them represents a syllable(e.g. Hankul for Korean) [3]. Hindi writing encapsulates in itlinguistic awareness of its speakers at multiple levels- phonetic,phonological [4, 6, 11] and morphological [12, 13] and lendit the character of a mix of surface and deep orthographicsystems, rather than a simple surface orthographic system (see[14], for a discussion of the types of orthographic systems)The main features of an alpha-syllabic script are the following:(5) • The consonant letter has an inherent vowel, whichis normally the mid central vowel / @ / or its lowercounterpart / /.• The inherent vowel can be deleted and the consequentform is represented with a subscript diacritic known as halant . The inherent vowel is assumed to give way toanother vowel when added with a diacritic known as matra (mora).• When not preceded by a consonant, a vowel occursalone with a full syllabic and vocalic value, e.g. a(cid:65)(cid:73) / a:i: / ’me-PST-FEM-SING’ When preceded by a consonant, the vowel is representedas a diacritic (a subscript, a superscript or a lateral scriptalong side a superscript), and is known as a matra , asshown in Table 1:VowelGrapheme IPASymbol UnicodeValue With On-set Con-sonants Consonantshapesformed (e.g.with theconsonantletter k) a @ U+0905 Nil (cid:63) + a = (cid:107)a(cid:65) a: U+0906 (cid:65) (cid:63) + a(cid:65) = (cid:107)(cid:65)i i U+0907 (cid:69) (cid:63) + i = (cid:69)(cid:107)(cid:73) i: U+0908 (cid:70) (cid:63) + (cid:73) = (cid:107)(cid:70) Table 1:
Examples of four Hindi vowels in isolation and fol-lowing onsets • Two or three consonant clusters form ligatures, with oneof the consonants (usually the last, except when it is /r/)occurring full but the others occurring half (e.g. (cid:40)(cid:115) /ts/, (cid:45)(cid:61)(cid:108) /spl/) or occasionally forming a new grapheme, e.g. (cid:47) (from (cid:40) + (cid:114) ) /t+r/.• In addition, the following superscript ” (cid:32) ” called chan-drabindu is used for nasal vowels. Another superscript” (cid:92) ”, a bindu (translated as point) has two values- (a)a homorganic nasal consonant, and (b) a nasal vowel.The use of the bindu for both nasal consonants as well asnasal vowels has lent indeterminacy in the orthographyphonology conversion. The orthography requires lexicalknowledge of its use.
5. Essential background to the Model
There are two related lexica for two styles of Standard Hindibeing described here- Standard Formal Hindi (SFH) andStandard Colloquial Hindi (SCH). They are, however, in asubset relation. SCH is a superset of SFH. SCH containsmainly two points of difference in segmental realization ofwords from SFH, as stated in (6) below.(6) • SFH maintains vowel length distinction between /i i:/and /u u:/ throughout in the word; SCH neutralizes itword-finally, e.g.SFH SCH Gloss @"tit h i " @tit h i: guest s@"ma:d h i s@"ma:d h i: trance/ mausoleum• Word-internal [ @ ] is in general deleted under certain cir-cumstances in SCH (discussed at length in [15, 8, 7], butis retained in SFH, e.g. SFH SCH Gloss "k@m@la: " k@mla: ’a name’ "kit@ni: ’kitni: ’how much-fem’ The central focus of the present programme was the generationrather then listing of the entries of lexemes in the lexicon.In order to achieve this end a well-ordered set of rules wasformulated in three separate steps.The first set of rules was of correspondence between graphemesand their IPA equivalents. All the graphemic shapes were takeninto account- as single consonant and vowel letters, and as CV,CCV and CCCV aksharas, e.g., a / @ /, (cid:107)(cid:70) / ki: /, (cid:63)(cid:121)(cid:65) / kja: /, (cid:45)(cid:47)(cid:70) / stri: /The second set of rules was regarding the changes in thegrapheme- phoneme correspondence on account of segmentalprocesses. These included (a) final vowel lengthening, wherebyall short vowel graphemes had to correspond to their longvowel counterparts, e.g. a(cid:69)(cid:116)(cid:69)(cid:84) / @tit h i / → / @tit h i: /, and (b)word-final schwa deletion , e.g. (cid:107)(cid:109)(cid:108) / k@m@l@ / → / k@m@l / Thethird set of rules consisted of prosodic structure rules. Thesewere mainly rules that involved division of words into syllablesand labeling of syllables. The labeling was in two stages, inturn. In the first stage, labeling was based on three degrees ofweight [16, 17, 18], namely Light (CV), Heavy (CVV/CVC)and Superheavy (CVCC/ CVVC), which were labeled as σ w , σ s and σ s’ , respectively.The second stage, the relabeling stage, wasconsequent on the following phenomena —(a) word-internalschwa deletion [19], which targeted / @ / in σ w syllables, e.g./ [email protected]@.la: / → / [email protected]: /, (b) Demotion or downgrading of σ s syllables into σ w syllables on account of stress clash, e.g./ σ s a: σ s’ sa:n/ → / σ w a: σ s’ sa:n/ ’easy’,(c) and promotion orupgrading of σ w syllables as σ s in disyllabic words with thefirst syllables as Light syllables, as in / "k@la: /. There is onemore metrical feature that was incorporated in the prosodicstructure of words, namely, Extrametricality [18] of final heavysyllables. The final extrametrical <σ s > syllables were ignoredfor stress. All other σ s and σ s were converted into stress at the < phoneme > level. The significance of Extrametricality of thefinal heavy syllable is that in the speech synthesis of longerutterances, the word-final heavy syllable can be optionallyde-extrametricalized and stressed, e.g. [ "kit@ni: ] > [ "kitni: ] ’howmuch- FEM’ (focused).An earlier programme of PLS took into account thephenomenon of consonant gemination as part of the secondset of rules. Consonants in Hindi are geminated or lengthenedwhen preceding /r, l, V , j/. The rule can be stated as (C+r/l/ V /j → C1C+r/l/ V /j), e.g., / mætri: / → [ mættri: ] ’friendship’ and/ a:V@Sj@k / → and [ a:V@SSj@k ] ’necessary’. This feature isunderspecified in the orthogrpahy [11]. It has been left out ofthe final version of the model on the grounds of its suitability toTTS and ASR systems. We find that the duration of geminates/tt ll/, etc. is not the double of singlets, but of less duration. Forexample, it was found that the duration of singlet /t/ and /l/ was0.96 and 0.91 respectively, and of geminates of 1.10 and 1.06ms. in doublets such as / p@ta: / address and / p@tta: / leaf, and/ b@la: / trouble / b@lla: / bat(N). In such a case the duration of thegeminate consonants can be roughly predicted by rule for thegiven contexts. Besides, the programme was simpler withoutthe rule of gemination.he full derivations of four forms (cid:107)(cid:115)(cid:114)(cid:116) / k@s@r@t@ / ’exer-cise’ and (cid:109)(cid:99)(cid:108)(cid:116)(cid:70) / m@úS@l@ti: / ’prancing’, (cid:112)(cid:125)(cid:107)(cid:2) (cid:69)(cid:116) / pr@kriti / ’na-ture’ and (cid:115)(cid:65)(cid:108)(cid:65)(cid:110)(cid:65) / sa:la:na: ’yearly’ by applying the A/LTSrules of Hindi are given in Table-2 below. Gr stands forGrapheme. In order for efficient automatic generation of orthographicforms of Hindi words into prosodic structure and IPA symbolsby means of hand-written rules in the model, it was necessaryto include an Exceptions component which is fed with theoutput of given words that are not in conformity with the restof the outputs (see [20] with a similar suggestion for Tamil).These are the following in the main:(7)a.
Native vocabulary i. Limited sets of forms with some prefixes and com-pounds without internal word-boundaries. For example, / su-s@m@j / ’good time’ and / ku-s@m@j / ’bad time’, in which theschwa in the first syllable will be wrongly deleted- *[ "susm@j ]and *[ "kusm@j ] in place of [ "sus@m@j ] and [ "kus@m@j ].ii. Non-honorific Imperative forms of verbs ending in thesuffix /-a:/. For example, / di"kha: / ’show-CAUS+IMP+NH’ ver-sus / "dikha: / ’was seen’.iii. Complex words with O-O correspondence relation withsimple words. For example, the plural forms of femininenouns / m@úS h @li: / ’fish’ and / tit@li: / ’butterfly’ as / m@úS h @lij˜a: /’fish-PL’ and / tit@lij˜a: / ’butterfly-PL’. The output forms withstress following the regular A/LTS rules would be *[ m@"tS h @lij˜a: ]*[ ti"t@lij˜a: ]. The preferred forms are [ "m@úS h lij˜a: ] and [ "titlij˜a: ]instead. The latter are placed in the Exceptions component.iv. The use of the diacritics of bindu and chandrabindu inHindi orthography for two phenomena- vowel nasalization andnasal consonant. Although they are in complementary distri-bution in general, with the chandrabindu standing for a nasalvowel, and the bindu standing for a nasal consonant, as in (cid:80)(cid:32)(cid:115)(cid:65) / p h ˜@sa: / ’trapped (PASS)’ (cid:80)(cid:92)(cid:100)(cid:65) / ph@nda: / ’noose’, in actualuse there is a lot of free variation.b. Borrowed English vocabulary : words borrowed fromEnglish containing short vowels but non-distinctive diacritic(matra) for the vowels [e e:], e.g. ˘ (cid:107)(cid:65)(cid:108)(cid:3)(cid:106) ’college’. The rule-based form here would be *[ "kOle:dZ ]. The regular pronouncedform is [ "kOledZ ].
6. Evaluation
The PLS programme for Standard Colloquial Hindi was takenfor evaluation, as it contained more features for evaluation thanthe programme for Standard Formal Hindi, as discussed in sec-tion 5.1. The data used for evaluation came from a BBC wordcorpus containing 28731 words. An expert for the pronuncia-tions of the words first annotated the words and then the out-put of the machine was compared against the annotated list.The following factors were taken into account in evaluating themodel:a. Only the regular features of pronunciation were considered.b. A transcription was considered correct if it contained oneof the main features of akshara-to-sound correspondence in theexpected context, as described.c. In evaluating the model for the regular features of pronunci- ation, the factors relating to exceptions had to be excluded.Out of the total number of 28731 words tested, the number ofwords with the diacritics bindu and chandrabindu for the re-alization of nasal consonants and nasal vowels was 3705 andwith deletable schwas 14460. For stress marking as well as forthe akashara-to-phoneme correspondence, the entire corpus of28731 words was considered relevant. For the marking of stress,although only polysyllabic words were relevant, we had to en-sure that monosyllabic words were not marked redundantly forstress. Thus the entire corpus of 28731 words was relevant forstress. The result of the analysis of errors is presented in Table3 below:
7. Conclusion
Developing a programme for generating a pronunciation lexi-con for standard Hindi involved coming to terms with severalissues described above. The programme began with generat-ing initial grapheme to phoneme correspondences. However,in an attempt to automatically generate the final output at the < phoneme > level it was found that the mixed nature of theHindi orthography with both surface and deep correspondencesmade it necessary to automatically generate the internal word-prosodic structure on which the main aspects of segmental real-ization of forms depended. The present model of pronunciationlexicon for Hindi can be easily applied to other akshara-basedorthographic systems with mixed level orthographies. In ad-dition, the contributions extend beyond the scope of a pronun-ciation lexicon and have general applicability for automatingsyllable-based processes and stress patterns in South Asian lan-guages.
8. References
Writing Systems Research ,vol. 6, no. 1, pp. 105–119, 2014.[3] P. Pandey, “Akshara-to-sound rules for hindi,”
Writing SystemsResearch , vol. 6, no. 1, pp. 54–72, 2014.[4] P. Patel, “Akshara as a linguistic unit in br¯ahm¯ı scripts,”
The Indicscripts: Paleographic and linguistic perspectives , pp. 167–215,2007.[5] A. W. Black and P. A. Taylor, “Automatically clustering similarunits for unit selection in speech synthesis.” 1997.[6] D. H. Klatt and D. W. Shipman, “Letter-to-phoneme rules: Asemi-automatic discovery procedure,”
The Journal of the Acousti-cal Society of America , vol. 72, no. S1, pp. S48–S48, 1982.[7] M. Choudhury and A. Basu, “A rule-based schwa deletionalgorithm for hindi,” in
Proc. International Conference OnKnowledge-Based Computer Systems , 2002, pp. 343–353.[8] B. Narasimhan, R. Sproat, and G. Kiraz, “Schwa-deletion in hinditext-to-speech synthesis,”
International Journal of Speech Tech-nology , vol. 7, no. 4, pp. 319–333, 2004.[9] S. Hussain, “Letter-to-sound conversion for urdu text-to-speechsystem,” in
Proceedings of the Workshop on Computational Ap-proaches to Arabic Script-based Languages . Association forComputational Linguistics, 2004, pp. 74–79.[10] S. Kishore, R. Kumar, and R. Sangal, “A data driven synthesis ap-proach for indian languages using syllable as basic unit,” in
Pro-ceedings of Intl. Conf. on NLP (ICON) , 2002, pp. 311–316. ule Applied Gr-1: (cid:107)(cid:115)(cid:114)(cid:116)
Output-1 Gr-2: (cid:109)(cid:99)(cid:108)(cid:116)(cid:70)
Output-2 Gr-3: (cid:112)(cid:125)(cid:107)(cid:2) (cid:69)(cid:116)
Output-3 Gr-4: (cid:115)(cid:65)(cid:108)(cid:65)(cid:110)(cid:65) output-4ATS Rules-1,2,3 Akshara-IPA Corre-spondencerules / k@s@r@t@ / / m@úS@l@ti: / / pr@kriti / / sa:la:na: /ATS Rule-4Final SchwaDeletion [ k@s@r@t ] DNA DNA DNAATS Rule-5Final VowelLengthening DNA DNA [ pr@kriti: ] DNAATS Rule-6Syllabification [ σ k@ σ s@ σ r@t ] [ σ m@ σ úS@ σ l@ σ ti: ] [ σ pr@ σ kri σ ti: ] [ σ sa: σ la: σ na: ]ATS Rule-7 SyllableLabeling byWeight [ σ w k@ σ w s@ σ s r@t ] [ σ w m@ σ w úS@ σ w l@ σ s ti: ] [ σ w pr@ σ w kri σ s ti: [ σ s sa: σ s la: σ s na: ]ATS Rule-8Syllable Extra-metricality [ σ w k@ σ w s@ <σ s r@t > ] [ σ w m@ σ w úS@ σ w l@ σ s ti: ] [ σ w pr@ σ w kri <σ s ti: > [ σ s sa: σ s la: <σ s na: > ]ATS Rule-9 SyllableRelabeling,Downgrading DNA DNA DNA [ σ w sa: σ s la: <σ s na: > ]ATS Rule-10: SyllableRelabeling,Upgrading [ σ s k@ σ w s@ <σ s r@t > ] [ σ w m@ σ s úS@ σ w l@ σ s ti: ] [ σ s pr@ σ w kri <σ s ti: > ] DNAATS Rule-11:Internal SchwaDeletion [ σ s k@ σ w s <σ s r@t > ] [ σ w m@ σ s úS@ σ w l σ s ti: ] DNA DNAATS Rule-12:Resyllabifica-tion [ σ s k@s <σ s r@t > ] [ σ w m@ σ s úS@l σ s ti: ] DNA DNAATS Rule14-15: Lev-eling andStress MarkAssignment [ "k@sr@t ] [ m@"úS@lti: ] [ " pr@kriti: ] [ sa:"la:na: ]Table 2: Rule Application and output
The Phenomena Tested
Performance of the proposed system [11] P. Pandey, “Phonology–orthography interface in devan¯agar¯ı forhindi,”
Written Language & Literacy , vol. 10, no. 2, pp. 139–156,2007.[12] R. Frost, “A universal approach to modeling visual word recogni-tion and reading: Not only possible, but also inevitable,”
Behav-ioral and Brain Sciences , vol. 35, no. 05, pp. 310–329, 2012.[13] C. Rao, S. Soni, and N. C. Singh, “The case of the neglected al- phasyllabary: orthographic processing in devanagari,”
Behavioraland Brain Sciences , vol. 35, no. 05, pp. 302–303, 2012.[14] R. W. Sproat and L. Qin,
A computational theory of writing sys-tems . MIT Press, 2000.[15] M. Ohala,
Aspects of Hindi phonology . Motilal Banarsidass Pub-lishe, 1983, vol. 2.[16] A. R. Kelkar,
Studies in Hindi-Urdu. 1, Introduction and wordphonology , 1968.[17] P. K. Pandey, “Hindi schwa deletion,”
Lingua , vol. 82, no. 4, pp.277–311, 1990.[18] B. Hayes,
Metrical stress theory: Principles and case studies .University of Chicago Press, 1995.[19] P. K. Pandey, “Word accentuation in hindi,”
Lingua , vol. 77, no. 1,pp. 37–73, 1989.[20] A. Ramakrishnan and M. Laxmi Narayana, “Grapheme tophoneme conversion for tamil speech synthesis,” in