Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Nirmesh J. Shah is active.

Publication


Featured researches published by Nirmesh J. Shah.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

A syllable-based framework for unit selection synthesis in 13 Indian languages

Hemant A. Patil; Tanvina B. Patel; Nirmesh J. Shah; Hardik B. Sailor; Raghava Krishnan; G. R. Kasthuri; T. Nagarajan; Lilly Christina; Naresh Kumar; Veera Raghavendra; S P Kishore; S. R. M. Prasanna; Nagaraj Adiga; Sanasam Ranbir Singh; Konjengbam Anand; Pranaw Kumar; Bira Chandra Singh; S L Binil Kumar; T G Bhadran; T. Sajini; Arup Saha; Tulika Basu; K. Sreenivasa Rao; N P Narendra; Anil Kumar Sao; Rakesh Kumar; Pranhari Talukdar; Purnendu Acharyaa; Somnath Chandra; Swaran Lata

In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.


international conference on acoustics, speech, and signal processing | 2014

Effectiveness of PLP-based phonetic segmentation for speech synthesis

Nirmesh J. Shah; Bhavik Vachhani; Hardik B. Sailor; Hemant A. Patil

In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.


international conference on asian language processing | 2013

A Novel Gaussian Filter-Based Automatic Labeling of Speech Data for TTS System in Gujarati Language

Swati Talesara; Hemant A. Patil; Tanvina B. Patel; Hardik B. Sailor; Nirmesh J. Shah

Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. In building the unit-selection based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is most time-consuming and tedious. This task requires large manual efforts. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that percentage correctness of labeled data is around 80% for both male and female voice as compared to 70% for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5% for Gaussian filter-based TTS system, compared to group delay-based TTS system. Also, 5% increment is observed in correctly synthesized words. The main focus of this work is to reduce the manual efforts required in building TTS system (which are primarily the manual efforts required in labeling speech data) for Gujarati.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati

Hemant A. Patil; Tanvina B. Patel; Swati Talesara; Nirmesh J. Shah; Hardik B. Sailor; Bhavik Vachhani; Janki Akhani; Bhargav Kanakiya; Yashesh Gaur; Vibha Prajapati

Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. As Indian languages are syllabic in nature, syllable is taken as the basic speech sound unit. In building the unit selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is manual, most time-consuming and tedious. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating almost accurate labeled speech corpus at syllable-level. To that effect, group delay-based segmentation, spectral transition measure (STM)-based and Gaussian filter-based methods are presented and their performances are compared. It has been observed that percentage of correctness of labeled data is around 83 % for both male and female voice as compared to 70 % for group delay-based labeling and 78 % for STM-based labeling. In addition, the systems built by labeled files generated from above methods were evaluated by a visually challenged subject. The word correctness rate is increased by 5 % (3 %) and 10 % (12 %) for Gaussian filter-based TTS system as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice. Similarly, there is an overall reduction in the word error rate (WER) of Gaussian-based approach of 8% (2%) and 6% (-5%) as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice.


9th ISCA Speech Synthesis Workshop | 2016

Novel Pre-processing using Outlier Removal in Voice Conversion.

Sushant V. Rao; Nirmesh J. Shah; Hemant A. Patil

Voice conversion (VC) technique modifies the speech utterance spoken by a source speaker to make it sound like a target speaker is speaking. Gaussian Mixture Model (GMM)-based VC is a state-of-the-art method. It finds the mapping function by modeling the joint density of source and target speakers using GMM to convert spectral features framewise. As with any real dataset, the spectral parameters contain a few points that are inconsistent with the rest of the data, called outliers. Until now, there has been very few literature regarding the effect of outliers in voice conversion. In this paper, we have explored the effect of outliers in voice conversion, as a pre-processing step. In order to remove these outliers, we have used the score distance, which uses the scores estimated using Robust Principal Component Analysis (ROBPCA). The outliers are determined by using a cut-off value based on the degrees of freedom in a chi-squared distribution. They are then removed from the training dataset and a GMM is trained based on the least outlying points. This pre-processing step can be applied to various methods. Experimental results indicate that there is a clear improvement in both, the objective (8 %) as well as the subjective (4 % for MOS and 5 % for XAB) results.


international conference on acoustics, speech, and signal processing | 2017

Novel Amplitude Scaling method for bilinear frequency Warping-based Voice Conversion

Nirmesh J. Shah; Hemant A. Patil

In Frequency Warping (FW)-based Voice Conversion (VC), the source spectrum is modified to match the frequency-axis of the target spectrum followed by an Amplitude Scaling (AS) to compensate the amplitude differences between the warped spectrum and the actual target spectrum. In this paper, we propose a novel AS technique which linearly transfers the amplitude of the frequency-warped spectrum using the knowledge of a Gaussian Mixture Model (GMM)-based converted spectrum without adding any spurious peaks. The novelty of the proposed approach lies in avoiding a perceptual impression of wrong formant location (due to perfect match assumption between the warped spectrum and the actual target spectrum in state-of-the-art AS method) leading to deterioration in converted voice quality. From subjective analysis, it is evident that the proposed system has been preferred 33.81 % and 12.37 % times more compared to the GMM and state-of-the-art AS method for voice quality, respectively. Similar to the quality conversion trade-offs observed by other studies in the literature, speaker identity conversion was 0.73 % times more and 9.09 % times less preferred over GMM and state-of-the-art AS-based method, respectively.


international symposium on chinese spoken language processing | 2014

Effectiveness of fractal dimension for ASR in low resource language

Mohammadi Zaki; Nirmesh J. Shah; Hemant A. Patil

We propose to use multiscale fractal dimension (MFD) as components of feature vectors for automatic speech recognition (ASR) especially in low resource languages. Speech, which is known to be a nonlinear process, can be efficiently represented by extracting some nonlinear properties, such as fractal dimension, from the speech segment. During speech production, vortices (generated due to presence of separated airflow) may travel along the vocal tract and excite vocal tract resonators at the epiglottis, velum, palate, teeth, lips, etc. By Kolmogorovs law, the gradient in energy levels between these vortices produces turbulence. This ruggedness, and in effect, the embedded features of different phoneme classes, can be captured by invariant property of FD. Furthermore, speech is a multifractal, which justifies the use of multiscale fractal dimension as feature components for speech. In this paper, we describe the multifractal nature of speech signal and use this property for automatic phonetic segmentation task. The results show a significant decrease in % EER (≈ 4.2 % from traditional MFCC base features and ≈ 2.5 % from MFCC appended with 1-D fractal dimension). The DET curves clearly show improvement in the performance with the new multiscale fractal dimension-based features for low resource language under consideration.


international symposium on chinese spoken language processing | 2014

Deterministic annealing EM algorithm for developing TTS system in Gujarati

Nirmesh J. Shah; Hemant A. Patil; Maulik C. Madhavi; Hardik B. Sailor; Tanvina B. Patel

The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS system built using Gujarati data is done in terms of naturalness and speech intelligibility. Apart from this, a conventional EM algorithm-based HTS and recently proposed Deterministic Annealing EM algorithm-based HTS has been applied to Gujarati (a low resourced language) and its relative comparison has been done. It has been found that HTS in Gujarati has very high intelligibility. It was verified from AB-test that 70.5 % times DAEM-based HTS has preferred over EM-based HTS developed for Gujarati language.


european signal processing conference | 2015

Effectiveness of multiscale fractal dimension for improvement of frame classification rate

Mohammadi Zaki; Nirmesh J. Shah; Hemant A. Patil

We propose to use multiscale fractal dimension (FD)-based features for phoneme classification task at frame-level. During speech production, turbulence is created and hence vortices (generated due to presence of separated airflow) may travel along the vocal tract and excite vocal tract resonators. This turbulence and in effect, the embedded features of different phoneme classes, can be captured by invariant property of multiscale FD. To capture complementary information, feature-level fusion of proposed feature with state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) is attempted and found to be effective. In particular, single-hidden layer neural nets were trained to compute the frame classification rate. Proposed feature was able to reduce the error rate by over 1.6 % from MFCC features on TIMIT database. This is supported by significant reduction in % EER (i.e., 0.327 % to 4.795 %)1.


international conference on asian language processing | 2014

Influence of various asymmetrical contextual factors for TTS in a low resource language

Nirmesh J. Shah; Mohammadi Zaki; Hemant A. Patil

The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this paper, we have applied HMM-based Speech Synthesis (HTS) method to Gujarati (one of the official languages of India). Adaption and evaluation of HTS for Gujarati language has been done here. In addition, to understand the influence of asymmetrical contextual factors on quality of synthesized speech, we have conducted series of experiments. Evaluation of different HTS built for Gujarati speech using various asymmetrical contextual factors is done in terms of naturalness and speech intelligibility. From the experimental results, it is evident that when more weightage is given to left phoneme in asymmetrical contextual factor, HTS performance improves compared to conventional symmetrical contextual factors for both triphone and pentaphone case. Furthermore, we achieved best performance for Gujarati HTS with left-left-left-centre-right (i.e., LLLCR) contextual factors.

Collaboration


Dive into the Nirmesh J. Shah's collaboration.

Top Co-Authors

Avatar

Hemant A. Patil

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Hardik B. Sailor

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Tanvina B. Patel

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Mohammadi Zaki

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Maulik C. Madhavi

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Swati Talesara

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anil Kumar Sao

Indian Institute of Technology Mandi

View shared research outputs
Top Co-Authors

Avatar

Arup Saha

Centre for Development of Advanced Computing

View shared research outputs
Top Co-Authors

Avatar

Bhargav Kanakiya

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge