Pramod H. Kachare
Veermata Jijabai Technological Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pramod H. Kachare.
Applied Soft Computing | 2014
Jagannath H. Nirmal; Mukesh A. Zaveri; Suprava Patnaik; Pramod H. Kachare
Graphical abstractDisplay Omitted HighlightsWe model pitch residuals using wavelet packet decomposed coefficients.Thus problem of artifacts generated due to direct transformation by ANN is alleviated.GRNN is proposed to modify vocal tract and wavelet packet decomposed pitch residuals.Mapping using GRNN model perform slightly better than GMM and RBF mapping models.Fast convergence of GRNN reduces computation time and overtraining of conventional ANN. The objective of voice conversion system is to formulate the mapping function which can transform the source speaker characteristics to that of the target speaker. In this paper, we propose the General Regression Neural Network (GRNN) based model for voice conversion. It is a single pass learning network that makes the training procedure fast and comparatively less time consuming. The proposed system uses the shape of the vocal tract, the shape of the glottal pulse (excitation signal) and long term prosodic features to carry out the voice conversion task. In this paper, the shape of the vocal tract and the shape of source excitation of a particular speaker are represented using Line Spectral Frequencies (LSFs) and Linear Prediction (LP) residual respectively. GRNN is used to obtain the mapping function between the source and target speakers. The direct transformation of the time domain residual using Artificial Neural Network (ANN) causes phase change and generates artifacts in consecutive frames. In order to alleviate it, wavelet packet decomposed coefficients are used to characterize the excitation of the speech signal. The long term prosodic parameters namely, pitch contour (intonation) and the energy profile of the test signal are also modified in relation to that of the target (desired) speaker using the baseline method. The relative performances of the proposed model are compared to voice conversion system based on the state of the art RBF and GMM models using objective and subjective evaluation measures. The evaluation measures show that the proposed GRNN based voice conversion system performs slightly better than the state of the art models.
International Scholarly Research Notices | 2014
Jagannath H. Nirmal; Suprava Patnaik; Mukesh A. Zaveri; Pramod H. Kachare
The complex cepstrum vocoder is used to modify the speaker specific characteristics of the source speaker speech to that of the target speaker speech. The low time and high time liftering are used to split the calculated cepstrum into the vocal tract and the source excitation parameters. The obtained mixed phase vocal tract and source excitation parameters with finite impulse response preserve the phase properties of the resynthesized speech frame. The radial basis function is explored to capture the nonlinear mapping function for modifying the complex cepstrum based real and imaginary components of the vocal tract and source excitation of the speech signal. The state-of-the-art Mel cepstrum envelope and the fundamental frequency () are considered to represent the vocal tract and the source excitation of the speech frame, respectively. Radial basis function is used to capture and formulate the nonlinear relations between the Mel cepstrum envelope of the source and target speakers. Mean and standard deviation approach is employed to modify the fundamental frequency (). The Mel log spectral approximation filter is used to reconstruct the speech signal from the modified Mel cepstrum envelope and fundamental frequency. A comparison of the proposed complex cepstrum based model has been made with the state-of-the-art Mel Cepstrum Envelope based voice conversion model with objective and subjective evaluations. The evaluation measures reveal that the proposed complex cepstrum based voice conversion system approximate the converted speech signal with better accuracy than the model based on the Mel cepstrum envelope based voice conversion.
Eurasip Journal on Audio, Speech, and Music Processing | 2013
Jagannath H. Nirmal; Mukesh A. Zaveri; Suprava Patnaik; Pramod H. Kachare
The framework of voice conversion system is expected to emphasize both the static and dynamic characteristics of the speech signal. The conventional approaches like Mel frequency cepstrum coefficients and linear predictive coefficients focus on spectral features limited to lower frequency bands. This paper presents a novel wavelet packet filter bank approach to identify non-uniformly distributed dynamic characteristics of the speaker. Contribution of this paper is threefold. First, in the feature extraction stage, dyadic wavelet packet tree structure is optimized to involve less computation while preserving the speaker-specific features. Second, in the feature representation step, magnitude and phase attributes are treated separately to rule out on the fact that raw time-frequency traits are highly correlated but carry intelligent speech information. Finally, the RBF mapping function is established to transform the speaker-specific features from the source to the target speakers. The results obtained by the proposed filter bank-based voice conversion system are compared to the baseline multiscale voice morphing results by using subjective and objective measures. Evaluation results reveal that the proposed method outperforms by incorporating the speaker-specific dynamic characteristics and phase information of the speech signal.
Neural Computing and Applications | 2016
Jagannath H. Nirmal; Mukesh A. Zaveri; Suprava Patnaik; Pramod H. Kachare
The objective of voice conversion is to replace the speaker-dependent characteristics of the source speaker so that it is perceptually similar to that of the target speaker. The speaker-dependent spectral parameters are characterized using single-scale interpolation techniques such as linear predictive coefficients, formant frequencies, mel cepstrum envelope and line spectral frequencies. These features provide a good approximation of the vocal tract, but produce artifacts at the frame boundaries which result in inaccurate parameter estimation and distortion in re-synthesis of the speech signal. This paper presents a novel approach of voice conversion based on multi-scale wavelet packet transform in the framework of radial basis neural network. The basic idea is to split the signal acoustic space into different salient frequency sub-bands, which are finely tuned to capture the speaker identity, conveyed by the speech signal. Characteristics of different wavelet filters are studied to determine the best filter for the proposed voice conversion system. A relative performance of the proposed algorithm is compared with the state-of-the-art wavelet-based voice morphing using various subjective and objective measures. The results reveal that the proposed algorithm performs better than the conventional wavelet-based voice morphing.
Neurocomputing | 2017
Jagannath H. Nirmal; Mukesh A. Zaveri; Suprava Patnaik; Pramod H. Kachare
The voice conversion system modifies the speaker specific characteristics of the source speaker to that of the target speaker, so it perceives like target speaker. The speaker specific characteristics of the speech signal are reflected at different levels such as the shape of the vocal tract, shape of the glottal excitation and long term prosody. The shape of the vocal tract is represented by Line Spectral Frequency (LSF) and the shape of glottal excitation by Linear Predictive (LP) residuals. In this paper, the fourth level wavelet packet transform is applied to LP-residual to generate the sixteen sub-bands. This approach not only reduces the computational complexity but also presents a genuine transformation model over state of the art statistical prediction methods. In voice conversion, the alignment is an essential process which aligns the features of the source and target speakers. In this paper, the Mel Frequency Cepstrum Coefficients (MFCC) based warping path is proposed to align the LSF and LP-residual sub-bands using proposed constant source and constant target alignment. The conventional alignment technique is compared with two proposed approaches namely, constant source and constant target. Analysis shows that, constant source alignment using MFCC warping path performs slightly better than the constant target alignment and the state-of-the-art alignment approach. Generalized mapping models are developed for each sub-band using Radial Basis Function neural network (RBF) and are compared with Gaussian Mixture mapping model (GMM) and residual selection approach. Various subjective and objective evaluation measures indicate significant performance of RBF based residual mapping approach over the state-of-the-art approaches. HighlightsThe LSF fails to represent formant valleys but good for formant peaks. Hence, calculated warping path is not satisfactory to yield a better alignment.This LSF based warping overcome through a new alignment using MFCC based warping path, which improves the conversion performance of proposed system.Further, the existing techniques for mapping the LP-residual suffer from issues of artifacts generated in consecutive frames. The residual signal is also quite complex to map.In order to solve the high dimensionality issue of residual signal is reduced and the complexity of the model is decreased, the WPT and RBF pairs are employed.The experimental results prove that the proposed MFCC based warping path and the WPT-RBF based transformation for residual signal outperforms the state of the art methods of residual selection and GMM model.
international conference on industrial instrumentation and control | 2015
Pramod H. Kachare; Alice N. Cheeran; Jagannath H. Nirmal
Nonlinear frequency warping is of recent interest in speech processing studies as due to its ability to extract speaker specific features. In particular, all pass Bilinear Transformation (BLT) is highly recommended tool in realizing such desired frequency-warping. Proposed work suggests a generic approach to unwarp formerly unmanageable second order BLT frequency mappings. Various spline approximation techniques have been analyzed to fit the inverse warping functions. Experiments have been performed to inverts first and second order warping approximations of state of the art Mel scale and bark scale. This method has been employed to derive inverse triangular filter bank for arbitrary frequency warping.
advances in computing and communications | 2015
Pramod H. Kachare; Alice Cheeran; Jagganath Nirmal; Mukesh A. Zaveri
Voice conversion has been studied over past few decades and yet no flawless system has been developed. Primary restriction in developing conversion systems is decayed output speech quality. Work presented here alleviates this problem by mapping higher order excitation features along with state of the art spectral parameters. Well known linear predictive analysis is used to extract shape of the vocal tract and corresponding residual signal. Higher feature dimensionality of the excitation signal is confronted using synchronous segmentation and windowing of the signal. Each of the resulting frames are wavelet analyzed to calculate normalized sub-band energy coefficients forming a codebook. Conversion is obtained by selecting target residual corresponding to minimized energy cost function. Primary advantage of this technique is reduced dimensionality with satisfactory conversion statistics. Proposed method is compared with baseline residual selection approach using various subjective and objective tests. Wavelet features provide better selection criteria with slight improvement in output speech individuality.
advances in computing and communications | 2015
Ashwini Visave; Pramod H. Kachare; Amutha Jeyakumar; Alice N. Cheeran; Jagannath H. Nirmal
Use of modern technological advances in real-time biomedical analysis is very crucial. Current work focuses on glottal pathology discrimination based on non-invasive speech analysis techniques. Primary set back in developing such method is irregular performance depreciation of several state of the art acoustic features. To excuse such problems, we have used glottal to noise excitation ratio, which predicts the breathiness quotient of the speech signal and is supported by characteristic mean pitch value. To build a judicial model, we have used Artificial Neural Network (ANN) and Support Vector Machine (SVM). Categorization performance is compared using well known parameters like true positive rate, true negative rate and accuracy. Results of the analysis show slightly favored performance for SVM based decisive system.
international conference on communication and signal processing | 2013
Jagannath H. Nirmal; Pramod H. Kachare; Suprava Patnaik; Mukesh A. Zaveri
Procedia Technology | 2013
Jagannath H. Nirmal; Suprava Patnaik; Mukesh A. Zaveri; Pramod H. Kachare