Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hardik B. Sailor is active.

Publication


Featured researches published by Hardik B. Sailor.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

A syllable-based framework for unit selection synthesis in 13 Indian languages

Hemant A. Patil; Tanvina B. Patel; Nirmesh J. Shah; Hardik B. Sailor; Raghava Krishnan; G. R. Kasthuri; T. Nagarajan; Lilly Christina; Naresh Kumar; Veera Raghavendra; S P Kishore; S. R. M. Prasanna; Nagaraj Adiga; Sanasam Ranbir Singh; Konjengbam Anand; Pranaw Kumar; Bira Chandra Singh; S L Binil Kumar; T G Bhadran; T. Sajini; Arup Saha; Tulika Basu; K. Sreenivasa Rao; N P Narendra; Anil Kumar Sao; Rakesh Kumar; Pranhari Talukdar; Purnendu Acharyaa; Somnath Chandra; Swaran Lata

In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.


international conference on acoustics, speech, and signal processing | 2014

Effectiveness of PLP-based phonetic segmentation for speech synthesis

Nirmesh J. Shah; Bhavik Vachhani; Hardik B. Sailor; Hemant A. Patil

In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.


international conference on asian language processing | 2013

A Novel Gaussian Filter-Based Automatic Labeling of Speech Data for TTS System in Gujarati Language

Swati Talesara; Hemant A. Patil; Tanvina B. Patel; Hardik B. Sailor; Nirmesh J. Shah

Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. In building the unit-selection based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is most time-consuming and tedious. This task requires large manual efforts. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that percentage correctness of labeled data is around 80% for both male and female voice as compared to 70% for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5% for Gaussian filter-based TTS system, compared to group delay-based TTS system. Also, 5% increment is observed in correctly synthesized words. The main focus of this work is to reduce the manual efforts required in building TTS system (which are primarily the manual efforts required in labeling speech data) for Gujarati.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati

Hemant A. Patil; Tanvina B. Patel; Swati Talesara; Nirmesh J. Shah; Hardik B. Sailor; Bhavik Vachhani; Janki Akhani; Bhargav Kanakiya; Yashesh Gaur; Vibha Prajapati

Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. As Indian languages are syllabic in nature, syllable is taken as the basic speech sound unit. In building the unit selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is manual, most time-consuming and tedious. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating almost accurate labeled speech corpus at syllable-level. To that effect, group delay-based segmentation, spectral transition measure (STM)-based and Gaussian filter-based methods are presented and their performances are compared. It has been observed that percentage of correctness of labeled data is around 83 % for both male and female voice as compared to 70 % for group delay-based labeling and 78 % for STM-based labeling. In addition, the systems built by labeled files generated from above methods were evaluated by a visually challenged subject. The word correctness rate is increased by 5 % (3 %) and 10 % (12 %) for Gaussian filter-based TTS system as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice. Similarly, there is an overall reduction in the word error rate (WER) of Gaussian-based approach of 8% (2%) and 6% (-5%) as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice.


conference of the international speech communication association | 2016

Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition.

Hardik B. Sailor; Hemant A. Patil

Recently, we have proposed an unsupervised filterbank learning model based on Convolutional RBM (ConvRBM). This model is able to learn auditory-like subband filters using speech signals as an input. In this paper, we propose two-layer Unsupervised Deep Auditory Model (UDAM) by stacking two ConvRBMs. The first layer ConvRBM learns filterbank from speech signals and hence, it represents early auditory processing. The hidden units’ responses of the first layer are pooled as shorttime spectral representation to train another ConvRBM using greedy layer-wise method. The ConvRBM in second layer trained on spectral representation learns Temporal Receptive Field (TRF) which represent temporal properties of the auditory cortex in human brain. To show the effectiveness of the proposed UDAM, speech recognition experiments were conducted on TIMIT and AURORA 4 databases. We have shown that features extracted from second layer when added to filterbank features of first layer performs better than first layer features alone (and their delta features as well). For both databases, our proposed two-layer deep auditory features improve speech recognition performance over Mel filterbank features. Further improvements can be achieved by system-level combination of both UDAM features and Mel filterbank features.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition

Hardik B. Sailor; Hemant A. Patil

To learn auditory filterbanks, recently, we have proposed an unsupervised learning model based on convolutional restricted Boltzmann machine (RBM) with rectified linear units. In this paper, theory, training algorithm of our proposed model, and detailed analysis of learned filterbank are being presented. Learning of the model with different databases shows that the model is able to learn cochlear-like impulse responses that are localized in frequency-domain. An auditory-like scale obtained from filterbanks learned from clean and noisy datasets resembles the Mel scale, which is known to mimic perceptually relevant aspect of speech. We have experimented with both cepstral (denoted as ConvRBM-CC) as well as filterbank features (denoted as ConvRBM-BANK). On large vocabulary continuous speech recognition task, we achieved relative improvement of 7.21-17.8% in word error rate (WER) compared to Mel frequency cepstral coefficient (MFCC) features and 1.35-6.82% compared to Mel filterbank (FBANK) features. On AURORA 4 multicondition training database, the relative improvement in WER by 4.8-13.65% was achieved using a Hybrid Deep Neural Network-Hidden Markov Model (DNN-HMM) system with ConvRBM-CC features. Using ConvRBM-BANK features, we achieve absolute reduction of 1.25-3.85% in WER on AURORA 4 test sets compared to FBANK features. A context-dependent DNN-HMM system further improves performance with a relative improvement of 3.6-4.6% on an average for bigram 5k and tri-gram 5k language models. Hence, our proposed learned filterbank performs better than traditional MFCC and Mel-filterbank features for both clean and multicondition automatic speech recognition (ASR) tasks. A system combination of ConvRBM-BANK and FBANK features further improve performance in all ASR tasks. Cross-domain experiments where subband filters trained on one database are used for the ASR task of another database show that model learns generalized representations of speech signals.


international symposium on chinese spoken language processing | 2014

Fusion of magnitude and phase-based features for objective evaluation of TTS voice

Hardik B. Sailor; Hemant A. Patil

This paper analyzes the distance-based objective measures for evaluation of Text-to-Speech (TTS) systems (which is generally used objective measures). In this paper, we discuss some aspects of evaluation of speech quality of synthesized speech. Some of the limitations and issues of subjective evaluation are discussed and importance of objective measures is presented. Traditional objective measure using Dynamic Time Warping (DTW) distance is used in this work. We have used magnitude and phase-based features as well as auditory features to check effectiveness of objective measures for predicting quality of TTS voice. In particular, Mel Frequency Cepstral Coefficients (MFCC) features and phase-based Modified Group Delay-based Cepstral Coefficients (MGDCC) alone have no good correlation with subjective scores. However, feature-level fusion of MFCC and MGDCC gives better correlation than all other feature sets. With this fusion, we obtained value of correlation coefficient, -0.3 and -0.32 for Blizzard Challenge databases 2010 and 2011, respectively. The results also show significance of phase-based features for objective measures when used along with magnitude-based features. The experimental results show that distance-based measures still do not work well with Blizzard Challenge databases and need more general objective measures for measuring quality of TTS voice.


Journal of the Acoustical Society of America | 2017

Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition

Hardik B. Sailor; Hemant A. Patil

In this letter, authors propose an auditory feature representation technique with the filterbank learned using an annealing dropout convolutional restricted Boltzmann machine (ConvRBM) and noise-robust energy estimation using the Teager energy operator (TEO). TEO is applied on each subband of ConvRBM filterbank and pooled later to get the short-term spectral features. Experiments on AURORA 4 database show that the proposed features perform better than the Mel filterbank features. The relative improvement of 2.59%-11.63% and 1.26%-6.87% in word error rate is achieved using the time delay neural network and the bidirectional long short-term memory models, respectively.


european signal processing conference | 2016

Unsupervised learning of temporal receptive fields using convolutional RBM for ASR task

Hardik B. Sailor; Hemant A. Patil

There has been a significant research attention for unsupervised representation learning to learn the features for speech processing applications. In this paper, we investigate unsupervised representation learning using Convolutional Restricted Boltzmann Machine (ConvRBM) with rectified units for speech recognition task. Temporal modulation representation is learned using log Mel-spectrogram as an input to ConvRBM. ConvRBM as modulation features and filterbank as spectral features were separately trained on DNNs and then system combination is used. With our proposed setup, ConvRBM features were applied to speech recognition task on TIMIT and WSJ0 databases. On TIMIT database, we achieved relative improvement of 5.93% in PER on test set compared to only filterbank features. For WSJ0 database, we achieved relative improvement of 3.63-4.3% in WER on test sets compared to filterbank features. Hence, DNN trained on ConvRBM with rectified units provide significant complementary information in terms of temporal modulation features.


2015 International Conference on BioSignal Analysis, Processing and Systems (ICBAPS) | 2015

Spectro-temporal analysis of HIE and asthma infant cries using auditory spectrogram

Anshu Chittora; Hemant A. Patil; Hardik B. Sailor

In this paper, auditory spectrogram is proposed for analysis of HIE and asthma infant cries. Auditory spectrogram represents a 2-dimensional (i.e., 2-D) pattern of neural activity, distributed along a logarithmic frequency-axis. Features are derived from the auditory spectrograms of each class. These features are then used to train support vector machine (SVM) classifier. Effectiveness of the proposed features is shown by application of proposed features for classification of pathologies. Classification accuracy achieved with SVM classifier with radial basis function (RBF) kernel is 87.67%. Classification performance has been compared with the state-of-the-art method, i.e., Mel Frequency Cepstral Coefficients (MFCC). It has been observed that MFCC features are giving 86.13% classification accuracy. Fusion of proposed features with the MFCC features further improves the classification accuracy to 88.54%. High classification accuracy of auditory spectrogram can be attributed to its ability to retain both formant frequencies and low frequency harmonics.

Collaboration


Dive into the Hardik B. Sailor's collaboration.

Top Co-Authors

Avatar

Hemant A. Patil

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Nirmesh J. Shah

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Tanvina B. Patel

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Madhu R. Kamble

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Maulik C. Madhavi

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Dharmesh M. Agrawal

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Swati Talesara

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Anil Kumar Sao

Indian Institute of Technology Mandi

View shared research outputs
Top Co-Authors

Avatar

Anshu Chittora

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge