[PDF] Multilingual and Multimode Phone Recognition System for Indian Languages

Abstract

The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation, and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (1) Automatic speech mode classification (SMC) and (2) Automatic phonetic recognition using mode-specific multilingual phone recognition system (MPRS). In this work, the vocal tract and excitation source features are considered for speech mode classification (SMC) task. SMC systems are developed using multilayer perceptron (MLP). Further, vocal tract, excitation source, and tandem features are used to build the deep neural network (DNN)-based MPRSs. The performance of the proposed approach is compared with mode-dependent MPRSs. Experimental results show that the proposed approach which combines both SMC and MPRS into a single system outperforms the baseline mode-dependent MPRSs.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug Multilingual and Multimode Phone Recognition System forIndian Languages

Kumud Tripathi, M. Kiran Reddy and K. Sreenivasa Rao

Department of Computer Science and Engineering,Indian Institute of Technology Kharagpur, India

Abstract

The aim of this paper is to develop a ﬂexible framework capable of automatically rec-ognizing phonetic units present in a speech utterance of any language spoken in anymode. In this study, we considered two modes of speech: conversation, and read modesin four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposedapproach consists of two stages: (1) Automatic speech mode classiﬁcation (SMC) and(2) Automatic phonetic recognition using mode-speciﬁc multilingual phone recogni-tion system (MPRS). In this work, vocal tract and excitation souce features are con-sidered for speech mode classiﬁcation (SMC) task. SMC systems are developed usingmultilayer perceptron (MLP). Further, vocal tract, excitation source, and tandem fea-tures are used to build the deep neural network (DNN)-based MPRSs. The performanceof the proposed approach is compared with mode-dependent MPRSs. Experimentalresults show that the proposed approach which combines both SMC and MPRS into asingle system outperforms the baseline mode-dependent MPRSs.

Keywords:

Multimode; multilingual; phone recognition; mode classiﬁcation.

1. Introduction

The objective of the phone recognition system (PRS) is to convert a speech signalinto a sequence of phones. The PRS can be developed using data from single language ✩ Fully documented templates are available in the elsarticle package on CTAN. ∗ Kumud Tripathi: [email protected] Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India.

Preprint submitted to Journal of L A TEX Templates August 27, 2019

1, 2], or multiple languages [3, 4, 5]. A PRS trained with data from more than onelanguage is known as multilingual PRS (MPRS). In the literature, there are very fewstudies on multilingual phone recognition. Schultz et al., proposed multilingual acous-tic models for read speech recognition [3, 4]. Here, large vocabulary speech recognitionsystems are investigated for 15 languages. In [6], a uniﬁed approach for the develop-ment of the hidden Markov model (HMM) based multilingual speech recognizer isproposed. The study considered two acoustically similar languages, Tamil and Hindialong with an acoustically very different language, American English. Here, Bhat-tacharyya distance measure is used to group the acoustically similar phones across theconsidered languages. In [7], a syllable-based multilingual speech recognizer is devel-oped using 3 Indian languages: Telugu, Tamil, and Hindi. Here, vowel onset points areused as anchor points to derive the syllable-like units. Similar consonant-vowel unitsacross the considered languages are merged to train a multilingual speech recognizer.Mohan et al. [8] developed a small vocabulary multilingual speech recognizer usingtwo linguistically similar Indian languages: Marathi and Hindi. Here, the multilingualspeech data collected over mobile telephones are used for training a subspace Gaus-sian mixture model. The speaker variations are handled by employing a cross-corpusacoustic normalization technique. In [5], deep neural network (DNN)-based MPRS isdeveloped using MFCC and tandem features, for read speech. The study consideredspeech utterances from 4 Indian languages, namely, Telugu, Kannada, Odia, and Ben-gali. Here, the acoustically similar units from various languages are grouped by usingthe international phonetic alphabet (IPA) transcription for training the MPRS.In general, speech can be broadly classiﬁed into two modes, namely, read and con-versation modes [9, 10, 11]. Read mode is a formal mode of speech where a personspeaks in a constrained environment, for example, news broadcasts from television. Onthe other hand, conversation speech is spontaneous, informal, unstructured, and unor-ganized. Generally, an MPRS is trained and tested using data from the same speechmode (read or conversation), viewed as mode-dependent MPRS. The performance ofmode-dependent MPRS will be affected when input utterance belongs to a differentmode of speech. The reason for this is the mismatch in the acoustic characteristics ofspeech signals across various modes of speech. The Indian TV news channel such as2ews 18, Zee News, etc. may represent the real-time scenario where read and con-versation modes come together. The read speech is delivered by a single speaker suchas newsreader and whereas the conversation speech is delivered by a group of peoplediscussing the issues on a particular topic. Whenever the data is captured for this sce-nario, these two modes of speech come together and will reduce the accuracy of phonerecognition task. To improve the accuracy of phone recognition system in multimodespeech signals, a framework is required to decode both the modes (read, conversation)accurately. It is also observed from a renowned video-sharing website: YouTube thatthe speech transcription is auto-generated for the English news channel; however, nospeech transcription produced for news channel based on Indian languages. So, thecurrent study is motivated by the recognition of speech for Indian languages in readand conversation modes.There exists no previous work related to MPRS, which accepts speech utterancesfrom multiple modes at the input and generates a sequence of phonetic units at itsoutput. In this work, we develop a framework for automatically recognizing phoneticunits present in a speech utterance of any language spoken in any mode. The proposedapproach combines the speech mode classiﬁcation (SMC) system and MPRS into asingle framework. The SMC system is used as a front-end to recognize the mode of theinput speech utterance, which is then given as input to corresponding mode-speciﬁcMPRS to decode the phonetic units. Previous studies have explored the characteris-tics of various speech modes, such as neutral (read), soft, loud, whispered, shouted,etc. Hansen et al. [12] have analyzed the characteristics of neutral mode and furthercompared it with loud, soft, and stressed mode. Rostolland has presented the phoneticstructure and acoustic information of shouted mode in [13, 14], respectively. Zhang etal. [15] have studied and compared the vocal characteristics of neutral, whispered, soft,shouted, and loud speech modes. However, no previous work has explored the conver-sation speech mode. An utterance in conversation mode may include other speechmodes such as neutral, soft, loud, whispered, shouted, etc. Hence, it’s important tostudy the acoustic characteristics of conversation mode as well as the acoustic differ-ence between the conversation and read modes for recognizing speech in some real-lifeapplications. Therefore, read, and conversation modes are considered in this study for3eveloping the speech mode classiﬁcation systems. As multilayer perceptron (MLP)is standard classiﬁer for various speech applications [16, 17, 18], thus, the SMC sys-tems are developed using MLP in this work. The vocal tract and excitation sourceinformation have been investigated for developing SMC system. While Mel-frequencycepstral coefﬁcients (MFCCs) represent the vocal tract information, supra segmentallevel epoch strength contour (ESC) and pitch contour (PC) represents the excitationsource information. Further, the scores of the vocal tract and excitation source systemsare combined to improve the performance of the SMC system. The mode dependentMPRSs are developed using DNNs. The MPRSs are trained using a combination ofMFCCs, tandem features, and excitation source features. In this work, the develop-ment of SMC systems and MPRSs is carried out using data from four Indian languages:Telugu, Kannada, Odia, and Bengali. The signiﬁcance of the proposed framework isshown by comparing the performance with two mode-speciﬁc MPRSs.The workﬂow of the paper is as follows. Section 2 provides the details about thespeech corpora used in this work. Feature extraction techniques are described in Sec-tion 3. The signiﬁcance of excitation source features for speech mode classiﬁcationis provided in Section 4. The development of MPRS has been elaborated in Section5. The proposed speech mode classiﬁcation model is discussed in Section 6. Theevaluation of proposed SMC and MPRS have been carried out in Sections 7 and 8, re-spectively. Finally, conclusion and future work of this paper are mentioned in Section9.

2. Speech corpus

In this work, speech corpora of four Indian languages: Telugu, Kannada, Odia,and Bengali is considered for developing both speech mode classiﬁcation models andMPRSs. The speech corpora are collected as a part of consortium project titled

Prosod-ically guided phonetic engine for searching speech databases in Indian languages sup-ported by DIT, Govt. of India. The corpora contain speech data in read and con-versation modes. In this study, read speech is collected from news reading, and theconversational speech is collected from news interview. Complete details of speech4

Time (second) P i t c h C on t ou r o f B enga li U tt e r an c e s i n R ead M ode F1 F1 M1 M1 (a)

Time (second) P i t c h C on t ou r o f B enga li U tt e r an c e s i n C on v e r s a t i on M ode F2 F2 M2 M2 (b)

Time (second) P i t c h C on t ou r o f O d i a U tt e r an c e s i n R ead M ode F3 F3 M3 M3 (c)

Time (second) P i t c h C on t ou r o f O d i a U tt e r an c e s i n C on v e r s a t i on M ode F4 F4 M4 M4 (d)Figure 1: Pitch contour of Bengali utterances in read and conversation modes are shown in 1(a) and 1(b), andpitch contour of Odia utterances in read and conversation modes are shown in 1(c) and 1(d), respectively.F1, F2, F3, and F4 are distinct female speakers, and M1, M2, M3, and M4 are distinct male speakers. corpora are given in [19, 20]. The speech signals are sampled at a rate of 16 KHzwith 16 bits per sample. The duration of each wave ﬁle varies between 6 to 8 sec. Forall the speech signals, the phonetically and prosodically rich transcription is derivedusing the International Phonetic Alphabet (IPA) chart. An IPA transcription providesone symbol for every unique sound unit independent of the language information. Asa pre-processing step for SMC, we have removed silence regions from the speech sig-5als, and each speech ﬁle is chopped to have a ﬁxed duration of 5 sec. Table 1 detailsthe number of male and female speakers used for training and testing. The ﬁrst columncontains the name of the languages, and the corresponding mode name is listed in thesecond column. Next two columns show the number of male and female speakers fortraining the models. Last two columns depict the number of male and female speakersfor testing the models. Altogether, there are 12 distinct speakers for each mode of alanguage for training and 6 distinct speakers for each mode of a language for testing.Each training speaker has spoken 150 utterances whereas; each testing speaker has spo-ken 50 utterances. Note that speakers considered for training and testing are differentas well as randomly selected.

Table 1: The number of male and female speakers (for training and testing) from read and conversationmodes of Bengali, Odia, Telugu and Kannada languages. (Abbreviation- SM: Speech Mode, M: Male, F:Female)

Language SM

Bengali (B) Conv 6 6 3 3Read 6 6 4 2Odia (O) Conv 5 7 3 3Read 6 6 3 3Telugu (T) Conv 6 6 2 4Read 6 6 3 3Kannada (K) Conv 6 6 3 3Read 5 7 3 3

3. Feature Extraction

This section describes the methods for extracting the features for speech mode clas-siﬁcation and multilingual phone recognition systems. The vocal tract and excitationsource features explored for developing the speech mode classiﬁcation model are dis-6ussed in Section 3.1. The vocal tract, excitation source, and tandem features for train-ing the multilingual PRS are explained in Section 3.2.

In this work, SMC models are developed using vocal tract and excitation sourcefeatures. The 13-dimensional MFCCs along with ∆ and ∆∆ coefﬁcients are used tocapture the vocal tract information. The ∆ and ∆∆ coefﬁcients corresponds to the ﬁrstand second order derivative of MFCCs, respectively. The vocal tract parameters areextracted at frame level by considering a frame size of 25 ms and a frame shift of 10ms using the Hamming window.The excitation source features describe the variations in the vibration of the vo-cal folds while producing voiced segments of speech. In this work, the speech signalis parametrized for 100 ms frame at the supra-segmental level to capture the mode-speciﬁc excitation source information. Pitch contour (PC) and epoch strength contour(ESC) represent the supra-segmental level excitation source information. Pitch repre-sents the vibration frequency of the vocal cords during the speech production. Speakersutilize pitch to demonstrate the salience of words such as a higher pitch implies that theword is more vital than other words in a speech utterance. In this work, the pitch andepoch strength contours are calculated using zero frequency ﬁltering (ZFF) approach[21]. The ZFF method tries to estimate the pitch and epoch strength by ﬁnding epochlocations in the speech.The interval between two successive epochs represents the pitchperiod ( t ). The reciprocal of pitch period ( t ) will give the pitch ( p = t ) [22]. Theepoch strength corresponds to the slope around the positive zero crossings correspond-ing to the epoch locations present in the ZFF signal. The slope is estimated as thedifference between the successive samples of ZFF signal around the zero crossings[21].The vocal tract and source features are extracted at the frame level. The processingof excitation source features at frame level may get similar variation in consideredspeech modes. However, its pattern at the sentence level is almost different amongthe speech modes, and also not dependent on the speaker, for mode classiﬁcation. Thisjustiﬁcation can be validated by visualizing the pitch patterns given in Figure 1. Hence,7t’s signiﬁcant to process excitation source details at the sentence level and vocal tractdetails at the frame level. Therefore, in this work, mode discriminative characteristicsof the vocal tract and source features are studied separately. The model developedusing MFCCs are trained and tested at the frame level. Afterward, majority voting [23]is applied for taking decision at sentence level. In the excitation source model, trainingand testing are done at sentence level. Further, the scores generated at sentence levelfrom the vocal tract and excitation source models are combined using a weighted scorefusion technique for enhancing the total performance of the SMC model. In this work, the multilingual phone recognition system is trained using the vocaltract, excitation source, and tandem features. To capture the vocal tract information forMPRS, a 13-dimensional MFCCs along with ∆ and ∆∆ coefﬁcients are extracted at aframe size of 25 ms with an overlap of 10 ms. Table 2: Correlation coefﬁcients across the speech modes (SM) of multiple languages using pitch contour(PC) and epoch strength contour (ESC).

ID Language SM PC ESCWM BM WM BM m Mel-ﬁlter bank. For each ﬁlter bank, PDSS coefﬁcients are computed, and all ex-tracted PDSS coefﬁcients represent MPDSS coefﬁcients. In this study, 25-dimensionalMPDSS coefﬁcients are chosen for training the MPRSs.Residual Mel-frequency cepstral coefﬁcients (RMFCCs) are generated when a resid-ual signal is utilized for deriving MFCC features. The framewise RMFCCs are com-puted using 10th order LP residual. This process is similar to the estimation of speechMFCCs except that the input is residual signal instead of the speech signal. In thiswork, 13-dimensional RMFCCs along with ∆ and ∆∆ coefﬁcients are computed us-ing the frame size of 25 ms with 10 ms frame shift for developing MPRSs.Tandem features are the phone posteriors generated after training a classiﬁer usingspectral features. Tandem features also known as discriminative features as they areproduced from a discriminative classiﬁcation model. In this work, we have followedthe same procedure as discussed in [5] for extracting the tandem features. The spectralfeature MFCCs are used for training the discriminative classiﬁer deep neural network(DNN) for extracting the phone posteriors. The dimension of the feature vector is 44,which represent the total number of unique phones present in the considered speechcorpus (discussed in Section 2).

4. Signiﬁcance of excitation source features for mode classiﬁcation

The importance of pitch and epoch strength contours for mode classiﬁcation task ispresented by their respective correlation coefﬁcients (CCs) [24] for within and betweenmodes across the languages, in Table 2. Correlation coefﬁcients calculate the strengthof relationships among the speech signals. Assume R , and Q are two ﬁnite duration9peech signals. The CC between the R and Q can be calculated as follows: C ( R, Q ) = 1 L L X i =1 R i − µ R σ R ! Q i − µ Q σ Q ! (1)Where µ R and σ R are mean and standard deviation of R , respectively, and µ Q and σ Q are mean and standard deviation of Q . L is the number of samples in each signal. Thevalue of correlation coefﬁcient is maximum when signals are similar and minimumwhen signals are orthogonal. If the signals have no relation, then the value of thecoefﬁcient is .For each language, four speakers data (two speaker’s data per mode) is consideredfor computing the correlation coefﬁcients. Note that the speakers are distinct for eachlanguage and each speaker has uttered 50 distinct utterances. To normalize the speakervariability across languages, the mean subtraction is imposed on the feature vectorsacross all languages. The correlation coefﬁcients among modes of monolingual speechsignals are depicted using ID: 1,2,3,4, whereas among modes of multilingual speechsignals are shown with ID: 5. The values in the fourth and ﬁfth columns of the Table2 denote the CCs within the modes (WM) and between the modes (BM) using pitchcontour. Similarly, the values in the sixth and seventh columns of the Table 2 denotethe CCs within the modes (WM) and between the modes (BM) using epoch strengthcontour. For computing the average CC within the mode of a language, 50 distinctutterances spoken by speaker-1 of a mode is correlated with 50 distinct utterances spo-ken by speaker-2 of a mode. However, the average CC between the mode is computedusing 50 distinct utterances from each mode of a language. In ID: 1 of Table 2, theﬁrst value of the fourth and sixth columns presents the average CC of within conversa-tion (abbreviated as conv) mode of Bengali language using PC and ESC, respectively.However, the ﬁrst value of the ﬁfth and seventh columns represents the average CC ofconversation mode with respect to read mode of Bengali language using PC and ESC,respectively. A low average CC value indicates high dissimilarity and a high averageCC value indicates high similarity between pitch or epoch strength contours. From thetable, it can be seen that the average CC values within a mode of any language aremuch higher compared to the CC values with respect to other modes. This depicts thatthe pitch and epoch strength contours have signiﬁcant mode discrimination capability.10n Table 2, ID: 5 (K-T-B-O) shows the correlation coefﬁcients within and betweenmodes, across the languages. In ID: 5, the last value of the fourth and sixth columnsrepresents the average CC within the read mode of multilingual speech using PC andESC, respectively. However, the last value of the ﬁfth and seventh columns presentsthe average CC of read mode with respect to conversation mode of multilingual speechusing PC and ESC, respectively. The average CC value of read mode with respect toconversation mode (0.12) is less as compared to the CC value of the same mode (0.35)using PC. Similar trends can be observed for conversation mode. Hence, both thePCs and ESCs are highly similar within modes and highly dissimilar between modes,across the languages. This analysis conﬁrms that the pitch and epoch strength contourscontain mode discriminating information.

5. Development of Multilingual Phone Recognition System

A PRS trained with data from multiple languages is called as Multilingual PRS.The open-source speech recognition Kaldi toolkit (Povey et al. 2011) [25] is usedfor building the speech recognizers. The important components in building the phonerecognition system using Kaldi toolkit are feature extraction, acoustic and languagemodeling, and decoding of phone sequence. The vocal tract, excitation source, andtandem features (discussed in Section 3.2) are used for developing the multilingualphone recognition systems. DNNs are trained as in Zhang et al. [26] for acousticmodeling. For language modeling (LM), the prior probability of a phone sequence(also known as LM score) is estimated by learning the relation between phones fromthe training data. The language model will generate a more accurate score, in case ofhaving prior information about the speech task [27]. In this work, the effect of LMis also analyzed on the performance of the MPRS. The speech corpus described inSection 2 is considered for training MPRSs.The objective of an MPRS is to ﬁnd the sequence of phones ˆ H , whose likelihood toa given sequence of feature vectors A is maximum. The Eq. 2 shows the mathematical11ormulation for estimating the decoded phone sequence: ˆ H = arg max H P ( H | A ) (2)The term P ( H | A ) can be interpreted by applying Bayes’ rule as, P ( H | A ) = P ( A | H ) P ( H ) P ( A ) (3)The acoustic modeling is used for calculating the P ( A/H ) likelihood, whereas P ( H ) the prior probability of phone sequence is computed using language modeling. P ( A ) shows the prior probability of the feature vector. It can be avoided in the equation(2) as it is independent of the acoustic and linguistic results. In the current work, thebi-gram language model [28] is used for determining the P ( H ) . The ﬁnal decodingsequence can be computed as follows: ˆ H = logP ( A | H ) + αlogP ( H ) (4)Where α is the LM scaling factor which is applied to balance between the acoustic andlanguage model scores. The block diagram of the baseline MPRS is shown in Figure 2. The MPRS exploredin [5], is considered as our baseline MPRS which includes MFCC and tandem featuresfor developing MPRS using four Indian languages, namely, Telugu, Kannada, Odia,and Bengali. In [5], authors have developed MPRS for read mode speech, we referredit as read-baseline MPRS. In this work, we have developed separate MPRS for conver-sation mode and named it as conversation-baseline MPRS. The conversation-baselineand read-baseline MPRSs are developed in the same manner (as discussed in [5]).Theread-baseline MPRS is trained using read mode, and conversation-baseline MPRS istrained using conversation mode of speech from Telugu, Kannada, Odia, and Bengalilanguages.Table 3, shows the recognition accuracies of read-baseline MPRS, and conversation-baseline MPRS for both read and conversation modes of speech. The read-baseline andconversation-baseline MPRSs are mode dependent systems. The baseline MPRSs are12 able 3: Phone recognition accuracy (%) of read-baseline MPRS, and conversation-baseline MPRS for readand conversation modes of speech.

Model Accuracy (%)Read Conversation

Read-Baseline MPRS 64.45 33.50Conv-Baseline MPRS 30.30 64.23trained using mode dependent data from the four languages: Telugu, Kannada, Odia,and Bengali as discussed in Section 2. For evaluating the performance of these sys-tems, the test data is considered from both the modes of given languages in Section 2.In Table 3, we can see that read-baseline MPRS is giving better accuracy when testedwith read mode (64.45%) as compared to conversation mode (33.50%). Similarly,conversation-baseline MPRS is giving better performance when tested with conversa-tion mode (64.23%) than the read mode (30.30%). This indicates that the performanceof mode-speciﬁc MPRSs degrades when there is a mismatch in the modes of train andtest utterances. As we have seen in introduction Section 1 that the speech from In-dian news TV channels may contain read and conversation mode simultaneously. Insuch realistic cases, the baseline systems will fail in successfully recognizing speechfrom different modes. Hence, for achieving optimal performance, we need to processthe speech signal associated with a particular mode through the corresponding mode-speciﬁc MPRS. One way to achieve this is by manually tagging the mode of each inpututterance. However, this is complex and not feasible in a real-time scenario. Hence, itis required to develop a system which can automatically identify the mode of the inputspeech. This can be more advantageous if such a system is language independent. Thisnecessity motivated us to develop a speech mode classiﬁcation (SMC) system that canautomatically detect the mode of input speech of any language. It is noticed from theliterature survey that there is no work related to multilingual speech mode classiﬁca-tion. In further section, we will discuss the proposed approach which integrates theSMC system and MPRSs into a single framework.13 igure 2: Illustration of baseline MPRS from the multimode speech signals of multiple languages.

The block diagram of the proposed two-stage phone recognition system is shownin Figure 3. The ﬁrst stage consists of speech mode classiﬁcation, and the second stageconsists of phone recognition. At the ﬁrst stage, speech mode classiﬁer is automaticallydetecting the mode of input test speech, i.e., read mode or conversation mode. At thesecond stage, phonetic units are decoded by processing the input speech signal throughcorresponding mode dependent MPRS. The Mode dependent MPRS is trained usingphonetic data of a particular mode from four languages: Telugu, Kannada, Odia, andBengali (discussed in Section 2). The proposed MPRSs are developed using the sameexperimental setup as in the baseline system [5]. The difference between the baselineand proposed MPRSs is that the baseline MPRSs are developed using a combinationof MFCCs and tandem features, whereas the proposed MPRSs are developed using acombination of MFCCs, tandem features, and excitation source features. The proposedmodel has the capability to automatically detect the mode and respective phonetic tran-scription of the speech signal. This proposed framework is named as COMB-MPRSfor the rest of the paper. 14 igure 3: Block diagram of the proposed framework of 2-stage MPRS.

6. Development of Proposed Multilingual Speech Mode Classiﬁcation Model

The objective of multilingual speech mode classiﬁcation (SMC) model is to iden-tify either read mode or conversation mode of speech which is spoken in any Indianlanguage. Multilingual SMC model is a language-independent SMC model. For de-veloping an SMC model independent of a particular language, need to include trainingdata from that language. The multilingual SMC model is biased towards the languagesused for training the model because the speech mode characteristics vary differentlyamong the languages. In this study, the SMC model is developed using the vocal tractand excitation source features. As we can see in Figure 4, SMC models are designedat three stages. Multilayer perceptron (MLP) is used for building the classiﬁcationmodel. The main reason for using MLP for building the SMC models is that it cancapture the nonlinear relations hidden in the multi-dimensional feature patterns of theinput utterance. MLP can also deal with the unseen input vector, signiﬁcantly well. Forimproving the performance of the mode classiﬁcation, scores are combined at differentstages, as shown in Figure 4. Score level fusion is described in Section 6.5 and Sections6.2-6.4 include brief introduction of each stage.15 igure 4: Block diagram of SMC model using vocal tract and excitation source features.

In this work, MLP is explored for mapping a given speech signal into a speech modeby calculating invariant and discriminant features using its nonlinear processing. MLPis a feed-forward neural network [29] with one or more hidden layers between its inputand output layer. There are various parameters to tune in MLP, such as the numberof hidden layers, the number of nodes in each layer, and learning rate. Finding theseparameters is a necessary task for optimizing the MLP. According to [30], an MLP witha single hidden layer can approximate any nonlinear function with optimal accuracy.However, the required number of neurons in the hidden layer is not stated in [30].Hence, in this work, we have explored a three-layer multilayer perceptron. The three-layered MLP is initialized with 0.005 learning rate. The tanh non-linear function andthe softmax activation function are used at the hidden and output layers, respectively.In this work, we have explored various network structures for both excitation sourceand vocal tract based speech mode classiﬁcation tasks. For each model, the numberof nodes at the input layer p is equal to the dimension of input feature vectors. Thenumber of nodes at the output layer is equal to the number of classes in a classiﬁcationproblem. The number of nodes at the output layer r = 2 will be the same for allmodels because this is a two class problem. The number of units in the hidden layer q for each model is decided after performing experiments on huge training datasets. Theexplored network structure for excitation source and vocal tract based SMC models arespeciﬁed in stage-1 and 2 (see Section 6.2 and 6.3), respectively. Further, Standard16ack-propagation algorithm (stochastic gradient descent) [31] is used for training theMLP to minimize the root mean squared error between the actual and the predictedoutputs. The classiﬁcation accuracy of the speech modes is analyzed by calculating theobjective measure, such as root mean square error. The goal of stage-1 is to build separate SMC model for excitation source features.The excitation source features are captured from pitch and epoch strength contours ofspeech signals. Each feature contains distinct mode-related information. Therefore,two different SMC models are developed using MLP at stage-1: (i) SMC using pitchcontour is denoted as PC-SMC model, and (ii) SMC using epoch strength contour isrepresented as ESC-SMC model. The number of nodes at the input and hidden layersfor developing the SMC models using the pitch contour are p = 500 and q = 56 ,respectively. The similar, network structure is followed for epoch strength contourbased SMC model. These two MLP models are trained and tested at sentence level.The training process of PC-SMC and ESC-SMC models is terminated after 200 epochsas there is no substantial decrement in the error when the number of epochs is furtherincreased. The target of stage-2 is to design the SMC model using excitation source featuresand examine the mode discrimination details present in the excitation source and vocaltract features. Hence, at this stage, two separate SMC models are developed usingexcitation source and vocal tract features. The excitation source based SMC model isdeveloped by fusing the scores of pitch contour and epoch strength contour based SMCmodels from stage-1 and called an Integrated Src-SMC model. The fusion of scoresis performed on test data. Further, the vocal tract based SMC model is designed usingthe MFCC+ ∆ + ∆∆ feature named as VT-SMC model. In the VT-SMC model, the39-dimensional feature vector is extracted for each frame of an utterance. The MLPmodel developed using vocal tract features is trained and tested at the frame level.After testing, majority voting is applied for generating scores at the sentence level. The17umber of epochs required for training the VT-SMC model is 600. The number ofnodes at input, and hidden layers are p = 39 , and q = 21 , respectively. The objective of stage-3 is to design the single integrated SMC model. Here, theintegrated model is developed by combining the scores of the excitation source and vo-cal tract based SMC models of stage-2 and denoted as Src-VT-SMC model. It is worthnoting that the integrated model (Src-VT-SMC model) outperform the SMC modelsdeveloped using stand-alone features at almost the same computation time.

The score is an estimate of the relationship between the feature vectors of trainingand testing utterances. The score is represented in terms of the posterior probabilitycorresponding to each class for a speech frame. Evidence generated from classiﬁersis dependent on the initial features. In order to reduce the signiﬁcance of less efﬁ-cient features and to increase the impact of more signiﬁcant features, the scores can beweighted. This is known as weighted score level fusion. Adaptive weighted combina-tion scheme [32] has been used for combining the scores of speech mode classiﬁcation(SMC) models. In this work, weights are derived by seeking the range of weights foreach model present in a respective stage (see Figure 4) on the development set andthe ﬁnal optimal weights are used for evaluation. The idea behind the weighted scorefusion is to reduce the total error on the evaluation set. The fused score is calculated asfollows: ˆ S ( x ) = Σ ci =1 w i S i ( x ) (5)Where, i= { } represents the number of models c , for development set x scoregenerated from i th model is shown as S i ( x ) . Whereas, ˆ S ( x ) indicates the weightedfusion score. Weight of the i th model is represented as w i . Following conditions areapplied for capturing the suitable set of weights: Σ ci =1 w i = 1 and w i ≥ (6)18he weighting factor w i is computed at the step size of 0.01. For the Bengali lan-guage, we have experimented with 98 distinct sets of weighting factors for two modelsexplored at stage 1. Similarly, 98 distinct sets of weighting factors are explored fortwo models present at stage 2. Out of 98 diverse weighting values, the consideredone are those on which the best average accuracy is accounted for the integrated mod-els. At stage-2, best average performance of integrated Src-SMC model is obtained atweighting values of w = 0 . and w = 0 . for pitch contour and epoch strengthcontour based SMC models, respectively. However, at stage-3, best average accuracyof integrated Src-VT-SMC model is obtained at weighting values of w = 0 . and w = 0 . for excitation source and vocal tract based SMC models, respectively. Af-terward, similar experiments are performed for each language for obtaining the bestweighting values. From the experiments, it is observed that the computed weightingvalues of w = 0 . , w = 0 . , w = 0 . and w = 0 . are giving best perfor-mance for each monolingual and multilingual SMC models. Therefore, we can reportthat the computed weighting values are language independent and can be used as theuniversal set of weights for developing the SMC models.

7. Evaluation of SMC models

In this work, the signiﬁcance of excitation source and vocal tract features are ex-plored for developing the speech mode classiﬁcation models using multilayer percep-tron. Here, we considered read and conversation modes of speech from four Indianlanguages: Telugu, Kannada, Odia, and Bengali. In this work, we have explored mono-lingual and multilingual speech corpora for designing the SMC models. The modeltrained using speech signals from two different speech modes of individual languagesare named as a monolingual-SMC model. Likewise, the model developed using speechsignals from two modes of four Indian languages are labelled as multilingual-SMC(MSMC) model. The models developed using four monolingual speech data are treatedas a baseline model for our proposed MSMC model (denoted as K-T-B-O). The frame-work shown in Figure 4 is used for developing the monolingual and multilingual SMCmodels. In Table 4, we have shown the average classiﬁcation performance of SMC19 able 4: Performance of stage-1, stage-2, and stage-3 SMC models developed using monolingual and multi-lingual speech corpus.

Average Classiﬁcation Performance (%)ID Language Stage-1 Stage-2 Stage-3

Training Testing PC- ESC- Src- MFCC- MFCC+ ∆ Src-SMC+ Src-SMC+MFCCSMC SMC SMC SMC + ∆∆ -SMC MFCC-SMC + ∆ + ∆∆ -SMC1 Bengali (B) Bengali 51.12 64.37 72.24 83.05 88.67 91.12 93.162 Odia (O) Odia 52.12 66.42 75.15 84.25 89.64 92.02 94.293 Telugu (T) Telugu 46.45 64.44 72.21 79.32 84.78 88.14 91.674 Kannada (K) Kannada 44.19 60.21 69.43 76.72 82.61 87.07 90.875 K-T-B-O Bengali 51.06 61.45 70.36 79.63 85.12 89.62 91.956 Odia 51.23 63.27 71.52 80.24 86.35 90.05 92.597 Telugu 45.17 63.34 69.12 75.19 79.43 86.42 90.548 Kannada 42.23 58.56 66.75 73.56 78.51 85.19 89.439 Average models developed at stage-1, stage-2, and stage-3 using monolingual and multilingualspeech corpora. In the ﬁrst column, classiﬁcation accuracies of monolingual SMCsare depicted with IDs: (1-4), and the classiﬁcation accuracies of the MSMC model forindividual languages are shown in ID: (5-8). The average performance of the MSMCmodel across languages is shown in ID: 9. The second and third columns representsthe various languages involved in training and testing the models. The columns 4-10,contains the classiﬁcation accuracy for stage-1, 2 and 3 SMC models. The detailedexplanation about the performance of SMC models at each stage is given in Sections7.1-7.3.

At the stage-1, two different models are developed for each monolingual and multi-lingual speech corpora using two excitation source features at the suprasegmental level.The fourth and ﬁfth columns of the Table 4 shows the mode classiﬁcation performanceachieved by using suprasegmental features: pitch contour (PC) and epoch strengthcontour (ESC). From the results, we can see that the average classiﬁcation accuraciesobtained with the epoch strength contour are better than those achieved with the pitchcontour for each baseline and proposed multilingual models. The similar observation is20lso noted while analyzing the correlation coefﬁcients in view of mode discrimination.This represents that the epoch strength contour has better mode discrimination powerthan the pitch contour.

The aim of stage-2 is to develop two different speech mode classiﬁcation mod-els by processing the full excitation source and vocal tract features. The full excita-tion source based model at stage-2 is designed by fusing the scores from pitch andepoch strength contour based models and termed as Src-SMC. The vocal tract basedmodel at stage-2 are designed by using the 39-dimensional MFCC+ ∆ + ∆∆ feature. Foranalysis purpose, 13-dimensional MFCC feature is also considered for developing theSMC model. In columns 7 and 8, the average classiﬁcation accuracies achieved withthe MFCC+ ∆ + ∆∆ -SMC model are better than those obtained with the MFCC-SMCmodel for each monolingual and multilingual speech corpora. Therefore, we have con-sidered only MFCC+ ∆ + ∆∆ based vocal tract feature for developing the SMC model atstage-2 in Figure 4. Further, the average classiﬁcation performance of the SMC modeldeveloped using full excitation source features is less than the models developed usingMFCC+ ∆ + ∆∆ . This indicates that the vocal tract features contain signiﬁcantly highermode speciﬁc information than excitation source. At this stage, two integrated speech mode classiﬁcation models are developed byconcatenating scores from (i) Src-SMC and MFCC-SMC models (labelled as Int-1-SMC) and (ii) Src-SMC and MFCC+ ∆ + ∆∆ -SMC models (labelled as Int-2-SMC).In Table 4, the average mode classiﬁcation accuracies determined from the integratedmodels are displayed in the ninth and tenth columns. The performance of integratedmodels is better compared to the models based on the individual feature. For example,the accuracy of the Int-2 SMC model is better than the accuracies of the Src-SMC andMFCC+ ∆ + ∆∆ -SMC models. Among the two integrated models, the performance ofInt-2-SMC model is better than that of Int-1-SMC model. Please note that the inte-grated models are developed using score fusion instead of feature fusion.21rom the results of the Table 4, it is observed that the developed multilingual SMCmodels (see ID: 5-8) are providing similar results as compared to the monolingual SMCmodels (see ID: 1-4). Thus, a single MSMC model can replace multiple monolingualSMC models for mode classiﬁcation of the speech samples belongs to any language.Further, it can be noticed that the MSMC models perform equally well for all the con-sidered languages, and they are not biased towards any language. The results portraythat the Int-2-SMC model has better average classiﬁcation accuracy across the lan-guages. Hence, this integrated SMC model can be used as a front end of multilingualPRS for optimally recognizing the phonetic units present in the input speech signal ofany mode spoken in any language. The improved performance of the integrated model (Int-2-SMC) can be better un-derstood from Figure 5. The mode classiﬁcation using VT-SMC, Src-SMC and Int-2-SMC models are shown in Figure 5 for 25 random utterances from the multilingualtest dataset (discussed in Section 2). The selected utterances belong to read mode. InFigure 5, the red color circle is representing the correctly detected mode, and the blackcolor circle is representing the falsely detected mode of an utterance using the modelshown in the y-axis. The performance of Int-2-SMC model depends on the classiﬁca-tion accuracy of VT-SMC and Src-SMC models. It can be visualized from Figure 5that if both models fail, then the Int-2-SMC model will also fail in accurately classify-ing the mode. For example, the mode of utterance 15 is falsely detected in VT-MSCand Src-SMC models as well as in Int-2-SMC model. Further, analyzed that Src-SMCmodel recognizes some utterances which are not recognized by VT-SMC model. Sim-ilarly, VT-SMC model recognizes some utterances which are not recognized by Src-SMC model. For example, the mode of utterances 2 and 4 are recognized correctlyusing Src-SMC model whereas VT-SMC model is not able to recognize it correctly.Similarly, the mode of utterances 5 and 6 are recognized correctly by VT-SMC model,whereas Src-SMC model is not recognizing correctly. However, in the integrated model(Int-2-SMC) mode of utterances 2,4,5 and 6 are recognized correctly. This shows thatvocal tract and source features contain some complementary mode-speciﬁc information22 V T - S M C S r c - S M C I n t - - S M C S M C M ode l Figure 5: Illustration of mode classiﬁcation for a subset of evaluation utterances using VT-SMC, Src-SMC,and Int-2-SMC models. Correctly detected mode of an utterance is marked with the red circle, and falselydetected mode of an utterance is marked with the black circle. which results in better classiﬁcation accuracy with Int-2-SMC model.

In this section, we have analyzed the classiﬁcation accuracies for conversation andread modes of each monolingual and multilingual speech corpora. Here, we have con-sidered the best SMC model (labeled as the Int-2-SMC model) for calculating the per-formance for individual modes of each language in Table 5. In the ﬁrst column of theTable 5, mode performances of monolingual models are shown with IDs: (1-4) and formultilingual models are shown with IDs: (5-9). The second and third columns of Table5 depicts the various languages involved in training and testing the models. The fourthand ﬁfth column shows the performance of conversation and read modes speech data.From the results, it can be observed that the classiﬁcation performance for read mode isbetter than the conversation speech mode throughout the monolingual and multilingualmodels. In this work, read speech contains TV news reading style, and the conversa-tion speech contains TV news interview style. In read mode, a speaker utters in very23estricted environment, whereas in the conversation mode speakers have no restrictionwhile speaking, and sometimes speech can have read mode characteristics. This couldbe the reason for the better performance of read mode compared to the conversationmode. It is observed that the interviews mostly starts with a neutral expression. Thismay be the cause of the misclassiﬁcation of conversation speech. On the other hand,speakers in read mode are emphasizing some keywords which result in signiﬁcant vari-ation in acoustic characteristics. This may be the reason for misclassiﬁcation of readspeech.The pitch contour is considered for better understanding the acoustic and linguisticvariations among the read and conversation modes. Informally, we have analyzed thepitch variation on 30 utterances for each mode of a language. It is observed that pitchpattern is varying differently in each mode of a language; however, this variation issimilar to the speciﬁc mode of a language irrespective of the speakers. For visualizingthis analysis, examples of pitch contour are plotted for read and conversation modesusing 16 distinct utterances spoken by 8 distinct (4-male and 4-female) speakers inFigure 1. Here, 2 sentences are selected from each speaker for plotting pitch contour.Figures 1(a), and 1(b) represents the pitch contour of 8 distinct Bengali utterances inread and conversation modes, respectively. However, Figures 1(c), and 1(d), shows thepitch contour of 8 distinct Odia utterances in read and conversation modes, respectively.It can be observed from Figures 1(a,c) and 1(b,d) that the pitch contour correspondingto conversation mode is more dynamic compared to read mode, irrespective of languageand speaker. The presence of expression in conversation mode causes signiﬁcant pitchvariation in speech. However, the absence of emotion in read speech results in almostﬂat pitch contour. As well as, some of the pitch contours in Figure 1(b,d) are behavinglike read speech because of less expression in those sentences. Similarly, some ofthe pitch contours in Figure 1(a,c) are behaving like conversation speech because ofhighlighting keywords in an utterance. Hence, most of the read and conversation modesare acoustically and linguistically distinct, except for some cases (discussed above)where these modes have similar acoustic characteristics.From the Table 5, it can be observed that the multilingual SMC model (labeled asK-T-B-O) is performing similar to the monolingual SMC model (labeled as Bengali,24dia, Telugu, and Kannada) for classifying speech modes. Hence, the proposed mul-tilingual SMC model can be signiﬁcantly used for determining mode of input speechspoken in Indian languages which is used for developing the model.

Table 5: Classiﬁcation accuracy (%) of conversation, and read modes for monolingual and multilingual basedInt-2 SMC models.

ID Language Accuracy (%)Training Testing Conversation Read

Average

8. Evaluation of phone recognition systems

In this section, we have evaluated the performance of the baseline and the proposedMPRSs. The baseline and proposed MPRSs are trained and tested using data from thefour Indian languages: Telugu, Kannada, Odia, and Bengali discussed in Section 2. Inthis work, two separate baseline MPRSs are developed one for each read and conver-sation modes and named as read-baseline and conv-baseline MPRSs, respectively. Thebaseline mode-dependent MPRSs are developed using MFCCs and tandem features.On the other hand, two separate proposed MPRSs are developed one for each read andconversation modes and named as read-proposed, and conv-proposed MPRSs. A two-stage system is proposed named as a COMB-MPRS, this includes a speech mode clas-siﬁer (with maximum performance of 91.10% as shown in Table 4) at the ﬁrst stage,25hich acts as a switch between the mode-dependent MPRSs (read-proposed MPRSand conv-proposed MPRS) present in the second stage (see Figure 3). The proposedmode-dependent MPRSs is developed using a combination of MFCCs, tandem, andtwo excitation source features: MPDSSs and RMFCCs. The detailed description of thedevelopment of the SMC model is discussed in Section 6.For performance evaluation, the decoded transcription is matched with the originaltranscription, and the given Equation (7) is used for computing the Phone Error Rate( E ). E = S + D + IN (7)Where N represents the total number of phones in the original transcription, D accounts for the number of deletion errors, S represents the number of substitutionerrors and I accounts for the number of insertion errors in the decoded output. Table 6: Recognition accuracy (%) of phones from read and conversation (Conv) modes of speech in thepresence of baseline and proposed MPRSs.

Feature MPRS Accuracy (%)Read Conversation Average

MFCC+Tandem Read-Baseline 64.45 33.50 48.97(Baseline) Conv-Baseline 30.30 64.23 47.26MFCC+Tandem+ Read-proposed 68.15 38.31 53.23RMFCC+MPDSS Conv-proposed 34.85 66.53 50.69COMB-MPRS 61.02 59.37

Table 6 depicts the phone recognition accuracies of baseline and proposed sys-tems. The ﬁrst column of Table 6 represent the various features used for developingthe MPRSs. The second column shows the name of the recognition systems. In thiswork, the recognition accuracy of each system is computed at a detailed and ﬁner level.When a system is tested without the knowledge of speech mode, it will provide detailedanalysis which is shown in column ﬁfth. When a system is tested separately for read26nd conversation mode speech, it will provide ﬁner level information which is shownin column third and fourth. The ﬁrst and third rows contain the phone recognitionaccuracies for read-baseline and read-proposed MPRSs, which are trained with readmode data. In the second and fourth rows, the performance of conv-baseline and conv-proposed MPRSs are shown, which are trained using conversation mode of speechcorpora. The ﬁfth row is depicting the accuracy of the COMB-MPRS, which includesread-proposed and conv-proposed MPRSs, therefore, read, and conversation speechcorpora are used for training the mode-speciﬁc proposed MPRSs.From Table 6, it is observed that the average performance of the proposed MPRSsis better than the baseline MPRSs. The addition of excitation source features (RMFCCand MPDSS) has contributed to improvement in phone recognition accuracies withproposed MPRSs. This could be analyzed by comparing the performance of the mode-dependent proposed MPRSs with the mode-dependent baseline MPRSs. An overallimprovement of 4.26% and 3.43% are achieved with the read-proposed MPRS com-pared to read-baseline MPRS and the conv-proposed MPRS compared to conv-baselineMPRS, respectively. This shows that the excitation source features contain distinctphone related information.In Table 6, the average accuracies for read-proposed and conv-proposed MPRSs aredepicting that the testing data processed through the systems without the knowledge ofthe mode of input speech. For analyzing the accuracy at the ﬁner level, we have com-puted the recognition accuracy for each mode separately using the read-proposed andconv-proposed MPRSs. Since the prior knowledge about the modes of the testing datais given, thus, we have recognized the accuracy at a ﬁner level. It can be observedthat the read-proposed MPRS is providing a signiﬁcantly better result for read mode ascompared to conv-proposed MPRS. For the conversation mode, the signiﬁcant resultis provided by conv-proposed MPRS than the read-proposed MPRS. Hence, the bestperformance for read and conversation modes that could be achieved using the COMB-MPRS in an ideal case (100% MSMC accuracy) are 68.15% and 66.53%, respectively.Therefore, the average of best accuracies achieved for read and conversation modes willrepresent the overall accuracy that could be attained by the COMB-MPRS in an idealcase, which is 67.34%. However, the achieved performance for the COMB-MPRS hav-27ng an MSMC model with maximum performance (91.10%) is 60.19%, which is 7.15%smaller than the ideal case. This is because the MSMC model is not accurately classify-ing the modes of some utterances. But, on an average, the COMB-MPRS is performingmuch better than the mode-dependent baseline and proposed MPRSs. The overall im-provement of 11.22%, 12.93%, 6.96%, and 9.5% are achieved with the COMB-MPRScompared to read-baseline, conv-baseline, read-proposed, and conv-proposed MPRSs,respectively. The reason is that the input speech in COMB-MPRS is processed throughthe corresponding mode-speciﬁc MPRS. Here, the mode of the speech utterance isidentiﬁed using the speech mode classiﬁcation model placed at the front end. Thisshows the importance of the MSMC. Therefore, incorporating MSMC is important,and also in the future, we have to make it more closer to ideal accuracy. In Table 6,it is analyzed that the recognition accuracy achieved using the proposed system forread mode (61.02%) is better than the conversation mode phones (59.37%). This isbecause the multilingual SMC model at the front end of MPRS is giving better classi-ﬁcation performance for read mode (92.28%) than the conversation mode (89.97%) asshown in Table 4. The overall result indicates that combining mode-speciﬁc MPRSsand MSMC into a single framework can better decode the phonetic units present in aspeech utterance of any language spoken in any mode.

9. Conclusion

In this work, a system has been proposed for automatically recognizing phoneticunits present in speech utterances from multiple languages spoken in multiple modes.In this study, the considered modes of speech are conversation and read modes in fourIndian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed methodis explored at two-stage. In the ﬁrst stage, the mode of input speech is identiﬁed usinga multilingual SMC system (K-T-B-O) developed using vocal-tract and suprasegmen-tal level excitation source features. The vocal-tract information provides better SMCaccuracy (82%) compared to excitation source information (69%). Experimental anal-ysis indicates that there exists distinct mode-speciﬁc information between vocal-tractand excitation source features. Hence, the scores of excitation source based models28re further combined with the scores obtained from vocal tract based model to im-prove the SMC accuracy (91%). In the second stage, speech utterance will be routed tothe MPRS of the mode identiﬁed in the previous stage, and the phonetic units presentin the speech signal are determined. The evaluation of phone recognition shows thatthe proposed two-stage system signiﬁcantly outperforms the baseline mode-dependentMPRSs. In the future, we intend to investigate other speech features to improve MSMCaccuracy as it is required to have an ideal MSMC system with 100% SMC accuracyin the ﬁrst-stage. We will also explore articulatory features with vocal tract details forfurther improving the performance of MPRS. In this work, we have focussed on Indianlanguages; however, in further study, other languages will be explored for analyzingthe signiﬁcance of the proposed system.

ReferencesReferences [1] K. Manjunath, K. S. Rao, Source and system features for phone recog-nition, International Journal of Speech Technology 18 (2) (2015) 257–270. doi:10.1007/s10772-014-9266-0 .[2] R. Pradeep, K. S. Rao, Deep neural networks for Kannada phoneme recognition,in: Proceedings of Ninth International Conference on Contemporary Computing(IC3), JIIT, Noida, 2016, pp. 1–6. doi:10.1109/IC3.2016.7880202 .[3] S. Scanzio, P. Laface, L. Fissore, R. Gemello, F. Mana, On the use of a multilin-gual neural network front-end, in: Proceedings of Ninth Annual Conference ofthe International Speech Communication Association, Brisbane, Australia, 2008,pp. 2711–2714.[4] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal,O. Glembek, N. Goel, M. Karaﬁ´at, D. Povey, et al., Multilingual acous-tic modeling for speech recognition based on subspace Gaussian mixturemodels, in: Proceedings of International Conference on Acoustics Speech29nd Signal Processing (ICASSP), Dallas, Texas, 2010, pp. 4334–4337. doi:10.1109/ICASSP.2010.5495646 .[5] K. Manjunath, D. B. Jayagopi, K. S. Rao, V. Ramasubramanian, Develop-ment and analysis of multilingual phone recognition systems using indian lan-guages, International Journal of Speech Technology 22 (1) (2019) 157–168. doi:10.1007/s10772-018-09589-z .[6] C. S. Kumar, V. Mohandas, H. Li, Multilingual speech recognition: A uniﬁedapproach, in: Proceedings of Ninth European Conference on Speech Communi-cation and Technology, Lisbon, Portugal, 2005, pp. 3357–3360.[7] S. V. Gangashetty, C. C. Sekhar, B. Yegnanarayana, Spotting multilin-gual consonant-vowel units of speech using neural network models, in:Proceedings of International Conference on Nonlinear Analyses and Al-gorithms for Speech Processing, Berlin, Heidelberg, 2005, pp. 303–317. doi:10.1007/11613107_27 .[8] A. Mohan, R. Rose, S. H. Ghalehjegh, S. Umesh, Acoustic mod-elling for speech recognition in Indian languages in an agriculturalcommodities task domain, Speech Communication 56 (2014) 167–180. doi:10.1016/j.specom.2013.07.005 .[9] A. Batliner, R. Kompe, A. Kießling, E. N¨oth, H. Niemann, Can youtell apart spontaneous and read speech if you just look at prosody?,in: Speech Recognition and Coding, Springer, 1995, pp. 321–324. doi:10.1007/978-3-642-57745-1_47 .[10] E. Blaauw, Phonetic characteristics of spontaneous and read-aloud speech, in:Phonetics and Phonology of Speaking Styles, 1991.[11] V. Dellwo, A. Leemann, M.-J. Kolly, The recognition of read and spontaneousspeech in local vernacular: The case of zurich german, Journal of Phonetics 48(2015) 13–28. doi:10.1016/j.wocn.2014.10.011 .3012] J. H. Hansen, Analysis and compensation of stressed and noisy speech with ap-plication to robust automatic recognition, Signal Processing 17 (3) (1989) 282. doi:10.1016/0165-1684(89)90010-8 .[13] D. Rostolland, Phonetic structure of shouted voice, Acta Acustica united withAcustica 51 (2) (1982) 80–89.[14] D. Rostolland, Acoustic features of shouted voice, Acta Acustica united withAcustica 50 (2) (1982) 118–125.[15] C. Zhang, J. H. Hansen, Analysis and classiﬁcation of speech mode: whisperedthrough shouted, in: Eighth Annual Conference of the International Speech Com-munication Association, Antwerp, Belgium, 2007, pp. 2289–2292.[16] G. Dede, M. H. Sazlı, Speech recognition with artiﬁcial neu-ral networks, Digital Signal Processing 20 (3) (2010) 763–768. doi:10.1016/j.dsp.2009.10.004 .[17] O. Vinyals, S. V. Ravuri, Comparing multilayer perceptron to deep belief net-work tandem features for robust asr, in: international conference on acous-tics, speech and signal processing (ICASSP), IEEE, 2011, pp. 4596–4599. doi:10.1109/icassp.2011.5947378 .[18] S. G. Koolagudi, K. S. Rao, Emotion recognition from speech: a re-view, International journal of speech technology 15 (2) (2012) 99–117. doi:10.1007/s10772-011-9125-1 .[19] S. S. Kumar, K. S. Rao, D. Pati, Phonetic and prosodically rich transcribed speechcorpus in Indian languages: Bengali and Odia, in: Proceedings of InternationalConference on Oriental COCOSDA held jointly with Conference on Asian Spo-ken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, In-dia, 2013, pp. 1–5. doi:10.1109/icsda.2013.6709901 .[20] M. Shridhara, B. K. Banahatti, L. Narthan, V. Karjigi, R. Kumaraswamy,Development of Kannada speech corpus for prosodically guided phonetic31earch engine, in: Proceedings of international conference on oriental CO-COSDA held jointly with conference on Asian spoken language researchand evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 2013, pp. 1–6. doi:10.1109/icsda.2013.6709875 .[21] K. S. R. Murty, B. Yegnanarayana, Epoch extraction from speech signals, IEEETransactions on Audio, Speech, and Language Processing 16 (8) (2008) 1602–1613. doi:10.1109/tasl.2008.2004526 .[22] B. Yegnanarayana, K. S. R. Murty, Event-based instantaneous funda-mental frequency estimation from speech signals, IEEE Transactionson Audio, Speech, and Language Processing 17 (4) (2009) 614–624. doi:10.1109/tasl.2008.2012194 .[23] L. Lam, S. Suen, Application of majority voting to pattern recognition: ananalysis of its behavior and performance, IEEE Transactions on Systems,Man, and Cybernetics-Part A: Systems and Humans 27 (5) (1997) 553–568. doi:10.1109/3468.618255 .[24] J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson correlation coefﬁ-cient, in: Noise reduction in speech processing, Springer, 2009, pp. 1–4. doi:10.4135/9781412953948.n342 .[25] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Han-nemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognitiontoolkit, in: workshop on automatic speech recognition and understanding, no.EPFL-CONF-192584, IEEE Signal Processing Society, 2011.[26] X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural networkacoustic models using generalized maxout networks, in: International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), IEEE, Florence, Italy,2014, pp. 215–219. doi:10.1109/icassp.2014.6853589 .[27] D. Yu, L. Deng, Deep neural network-hidden markov model hybrid sys-32ems, in: Automatic Speech Recognition, Springer, 2015, pp. 99–116. doi:10.1007/978-1-4471-5779-3_6 .[28] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, J. C. Lai, Class-basedn-gram models of natural language, Computational linguistics 18 (4) (1992) 467–479.[29] D. Svozil, V. Kvasnicka, J. Pospichal, Introduction to multi-layer feed-forwardneural networks, Chemometrics and intelligent laboratory systems 39 (1) (1997)43–62. doi:10.1016/S0169-7439(97)00061-0 .[30] B. C. Cs´aji, Approximation with artiﬁcial neural networks, Faculty of Sciences,Etvs Lornd University, Hungary 24 (2001) 48.[31] L. Bottou, Large-scale machine learning with stochastic gradient descent,in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186. doi:10.1007/978-3-7908-2604-3_16 .[32] V. R. Reddy, S. Maity, K. S. Rao, Identiﬁcation of Indian languages using multi-level spectral and prosodic features, International Journal of Speech Technology16 (4) (2013) 489–511. doi:10.1007/s10772-013-9198-0doi:10.1007/s10772-013-9198-0