Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tanvina B. Patel is active.

Publication


Featured researches published by Tanvina B. Patel.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

A syllable-based framework for unit selection synthesis in 13 Indian languages

Hemant A. Patil; Tanvina B. Patel; Nirmesh J. Shah; Hardik B. Sailor; Raghava Krishnan; G. R. Kasthuri; T. Nagarajan; Lilly Christina; Naresh Kumar; Veera Raghavendra; S P Kishore; S. R. M. Prasanna; Nagaraj Adiga; Sanasam Ranbir Singh; Konjengbam Anand; Pranaw Kumar; Bira Chandra Singh; S L Binil Kumar; T G Bhadran; T. Sajini; Arup Saha; Tulika Basu; K. Sreenivasa Rao; N P Narendra; Anil Kumar Sao; Rakesh Kumar; Pranhari Talukdar; Purnendu Acharyaa; Somnath Chandra; Swaran Lata

In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.


international conference on acoustics, speech, and signal processing | 2016

Effectiveness of fundamental frequency (F0) and strength of excitation (SOE) for spoofed speech detection

Tanvina B. Patel; Hemant A. Patil

Current countermeasures used in spoof detectors (for speech synthesis (SS) and voice conversion (VC)) are generally phase-based (as vocoders in SS and VC systems lack phase-information). These approaches may possibly fail for non-vocoder or unit-selection-based spoofs. In this work, we explore excitation source-based features, i.e., fundamental frequency (F0) contour and strength of excitation (SoE) at the glottis as discriminative features using GMM-based classification system. We use F0 and SoE1 estimated from speech signal through zero frequency (ZF) filtering method. Further, SoE2 is estimated from negative peaks of derivative of glottal flow waveform (dGFW) at glottal closure instants (GCIs). On the evaluation set of ASVspoof 2015 challenge database, the F0 and SoEs features along with its dynamic variations achieve an Equal Error Rate (EER) of 12.41%. The source features are fused at score-level with MFCC and recently proposed cochlear filter cepstral coefficients and instantaneous frequency (CFCCIF) features. On fusion with MFCC (CFCCIF), the EER decreases from 4.08% to 3.26% (2.07% to 1.72%). The decrease in EER was evident on both known and unknown vocoder-based attacks. When MFCC, CFCCIF and source features are combined, the EER further decreased to 1.61%. Thus, source features captures complementary information than MFCC and CFCCIF used alone.


international conference on asian language processing | 2013

A Novel Gaussian Filter-Based Automatic Labeling of Speech Data for TTS System in Gujarati Language

Swati Talesara; Hemant A. Patil; Tanvina B. Patel; Hardik B. Sailor; Nirmesh J. Shah

Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. In building the unit-selection based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is most time-consuming and tedious. This task requires large manual efforts. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that percentage correctness of labeled data is around 80% for both male and female voice as compared to 70% for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5% for Gaussian filter-based TTS system, compared to group delay-based TTS system. Also, 5% increment is observed in correctly synthesized words. The main focus of this work is to reduce the manual efforts required in building TTS system (which are primarily the manual efforts required in labeling speech data) for Gujarati.


IEEE Journal of Selected Topics in Signal Processing | 2017

Cochlear Filter and Instantaneous Frequency Based Features for Spoofed Speech Detection

Tanvina B. Patel; Hemant A. Patil

Vulnerability of voice biometrics systems to spoofing attacks by synthetic speech (SS) and voice converted (VC) speech has arose the need of standalone spoofed speech detection (SSD) systems. This paper is an extension of our previously proposed features (used in relatively best performing SSD system) at the first ASVspoof 2015 challenge held at INTERSPEECH 2015. For the challenge, the authors proposed novel features based on cochlear filter cepstral coefficients (CFCC) and instantaneous frequency (IF), i.e., CFCCIF. The basic motivation behind this is that human ear processes speech in subbands. The envelope of each subband and its IF is important for perception of speech. In addition, the transient information also adds to the perceptual information that is captured. We observed that subband energy variations across CFCCIF when estimated by symmetric difference (CFCCIFS) gave better discriminative properties than CFCCIF. The features are extracted at frame level and the Gaussian mixture model based classification system was used. Experiments were conducted on ASVspoof 2015 challenge database with MFCC, CFCC, CFCCIF, and CFCCIFS features. On the evaluation dataset, after score-level fusion with MFCC, the CFCCIFS features gave an overall equal error rate (EER) of 1.45% as compared to 1.87% and 1.61% with CFCCIF and CFCC, respectively. In addition to detecting the known and unknown attacks, intensive experiments have been conducted to study the effectiveness of the features under the condition that either only SS or only VC speech is available for training. It was observed that when only VC speech is used in training, both VC, as well as SS, can be detected. However, when only SS is used in training, VC speech was not detected. In general, amongst vocoder-based spoofs, it was observed that VC speech is relatively difficult to detect than SS by the SSD system. However, vocoder-independent SS was toughest with highest EER (i.e., > 10%).


conference of the international speech communication association | 2016

Novel Subband Autoencoder Features for Detection of Spoofed Speech.

Meet H. Soni; Tanvina B. Patel; Hemant A. Patil

Deep Neural Network (DNN) have been extensively used in Automatic Speech Recognition (ASR) applications. Very recently, DNNs have also found application in detecting natural vs. spoofed speech at ASV spoof challenge held at INTERSPEECH 2015. Along the similar lines, in this work, we propose a new feature extraction architecture of DNN called the subband autoencoder (SBAE) for spoof detection task. The SBAE is inspired by the human auditory system and extracts features from the speech spectrum in an unsupervised manner. The features derived from SBAE are compared with stateof-the-art Mel Frequency Cepstral Coefficient (MFCC) features. The experiments were performed on ASV spoof challenge database and the performance was evaluated using Equal Error Rate (EER). It was observed that on the evaluation set, MFCC features with 36-dimensional (static+∆+∆∆) features gave 4.32% EER which reduced to 2.9% when 36-dimensional SBAE features were used. Further on fusing SBAE features at score-level with MFCC, a further reduction till 1.93% EER was observed. This improvement in EER was due to the fact that the dynamics of SBAE features captured significant spoof specific characteristics leading to detect significantly even vocoderindependent speech, which is not the case for MFCC. Index Terms Subband autoncoder, spoof detection, vocoder speech.


international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013

Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati

Hemant A. Patil; Tanvina B. Patel; Swati Talesara; Nirmesh J. Shah; Hardik B. Sailor; Bhavik Vachhani; Janki Akhani; Bhargav Kanakiya; Yashesh Gaur; Vibha Prajapati

Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. As Indian languages are syllabic in nature, syllable is taken as the basic speech sound unit. In building the unit selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is manual, most time-consuming and tedious. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating almost accurate labeled speech corpus at syllable-level. To that effect, group delay-based segmentation, spectral transition measure (STM)-based and Gaussian filter-based methods are presented and their performances are compared. It has been observed that percentage of correctness of labeled data is around 83 % for both male and female voice as compared to 70 % for group delay-based labeling and 78 % for STM-based labeling. In addition, the systems built by labeled files generated from above methods were evaluated by a visually challenged subject. The word correctness rate is increased by 5 % (3 %) and 10 % (12 %) for Gaussian filter-based TTS system as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice. Similarly, there is an overall reduction in the word error rate (WER) of Gaussian-based approach of 8% (2%) and 6% (-5%) as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice.


conference of the international speech communication association | 2016

Novel Nonlinear Prediction Based Features for Spoofed Speech Detection.

Himanshu N. Bhavsar; Tanvina B. Patel; Hemant A. Patil

Several speech synthesis and voice conversion techniques can easily generate or manipulate speech to deceive the speaker verification (SV) systems. Hence, there is a need to develop spoofing countermeasures to detect the human speech from spoofed speech. System-based features have been known to contribute significantly to this task. In this paper, we extend a recent study of Linear Prediction (LP) and Long-Term Prediction (LTP)-based features to LP and Nonlinear Prediction (NLP)-based features. To evaluate the effectiveness of the proposed countermeasure, we use the corpora provided at the ASVspoof 2015 challenge. A Gaussian Mixture Model (GMM)-based classifier is used and the % Equal Error Rate (EER) is used as a performance measure. On the development set, it is found that LP-LTP and LP-NLP features gave an average EER of 4.78 % and 9.18 %, respectively. Score-level fusion of LP-LTP (and LP-NLP) with Mel Frequency Cepstral Coefficients (MFCC) gave an EER of 0.8 % (and 1.37 %), respectively. After score-level fusion of LP-LTP, LP-NLP and MFCC features, the EER is significantly reduced to 0.57 %. The LP-LTP and LP-NLP features have found to work well even for Blizzard Challenge 2012 speech database.


international conference on acoustics, speech, and signal processing | 2015

A novel filtering based approach for epoch extraction

Pramod B. Bachhav; Hemant A. Patil; Tanvina B. Patel

In this paper, we propose a novel algorithm which uses simple lowpass filtering as pre-processing for detection of epochs. Lowpass filtering with an appropriate cut-off frequency removes the effect of vocal tract characteristics as formants lie in relatively higher frequency regions. The method is evaluated on entire CMU-ARCTIC database consisting of the electroglottograph (EGG) signals. Noise robustness of the proposed algorithm is evaluated in the presence of additive white noise with various SNR levels. Experimental results show that lowpass filtering make the proposed algorithm noise robust. The method gives comparable or better results with the two state-of-the-art methods, viz., ZFR and SEDREAMS (which require apriori knowledge of the pitch period). In addition, the proposed method shows an improvement in identification accuracy.


international symposium on chinese spoken language processing | 2014

Deterministic annealing EM algorithm for developing TTS system in Gujarati

Nirmesh J. Shah; Hemant A. Patil; Maulik C. Madhavi; Hardik B. Sailor; Tanvina B. Patel

The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS system built using Gujarati data is done in terms of naturalness and speech intelligibility. Apart from this, a conventional EM algorithm-based HTS and recently proposed Deterministic Annealing EM algorithm-based HTS has been applied to Gujarati (a low resourced language) and its relative comparison has been done. It has been found that HTS in Gujarati has very high intelligibility. It was verified from AB-test that 70.5 % times DAEM-based HTS has preferred over EM-based HTS developed for Gujarati language.


international conference on acoustics, speech, and signal processing | 2016

Analysis of natural and synthetic speech using Fujisaki model

Tanvina B. Patel; Hemant A. Patil

Text-to-speech (TTS) synthesis systems are being advanced to achieve naturalness and intelligibility in synthetic speech. Unit selection-based synthesis (USS) and Hidden Markov Model-based text-to-speech synthesis systems (HTS) are recent techniques in this area. USS-based synthetic speech is known to be natural (due to concatenation of natural speech sound units). On the other hand, HTS-based speech is not as natural in perception as USS-based synthetic speech. Due to speech synthesis technologies, voice biometrics systems may face threats due to impostor attacks. Thus, it is important to study the differences that exist between natural and synthetic speech. In this context, we investigate the effectiveness of parameters of Fujisaki model for capturing Fundamental frequency (F0) contour variations in natural and synthetic speech. F0 contour of speech contains linguistic and non-linguistic information. Experimental results on several utterances from Gujarati (a low resourced language) demonstrate the effectiveness of phrase and accent components to analyze the difference between these two speeches. Variability in phrase and accent components suggests that synthetic speech differs in terms of prosodic information in excitation source as compared to natural speech. These findings may assist to distinguish these two speeches and provide an aid to alleviate impostor attacks.

Collaboration


Dive into the Tanvina B. Patel's collaboration.

Top Co-Authors

Avatar

Hemant A. Patil

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Hardik B. Sailor

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Nirmesh J. Shah

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Maulik C. Madhavi

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Meet H. Soni

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Swati Talesara

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Top Co-Authors

Avatar

Anil Kumar Sao

Indian Institute of Technology Mandi

View shared research outputs
Top Co-Authors

Avatar

Arup Saha

Centre for Development of Advanced Computing

View shared research outputs
Top Co-Authors

Avatar

Avni Rajpal

Indian Institute of Chemical Technology

View shared research outputs
Top Co-Authors

Avatar

Bhargav Kanakiya

Dhirubhai Ambani Institute of Information and Communication Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge