Arun Kumar
Indian Institute of Technology Delhi
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Arun Kumar.
Journal of the Acoustical Society of America | 1996
Arun Kumar; S. K. Mullick
This paper reports results of the estimation of dynamical invariants, namely Lyapunov exponents, dimension, and metric entropy for speech signals. Two optimality criteria from dynamical systems literature, namely singular value decomposition method and the redundancy method, are used to reconstruct state space trajectories of speech and make observations. The positive values of the largest Lyapunov exponent of speech signals in the form of phoneme articulations show the average exponential divergence of nearby trajectories in the reconstructed state space. The dimension of a time series is a measure of its complexity and gives bounds on the number of state space variables needed to model it. It is found that most speech signals in the form of phoneme articulations are low dimensional. For comparison, a statistical model of a speech time series is also used to estimate the correlation dimension. The second‐order dynamical entropy (which is a lower bound of metric entropy) of speech time series is found to ...
north american chapter of the association for computational linguistics | 2003
Dharmendra Kanejiya; Arun Kumar; Surendra Prasad
Latent semantic analysis (LSA) has been used in several intelligent tutoring systems(ITSs) for assessing students learning by evaluating their answers to questions in the tutoring domain. It is based on word-document co-occurrence statistics in the training corpus and a dimensionality reduction technique. However, it doesnt consider the word-order or syntactic information, which can improve the knowledge representation and therefore lead to better performance of an ITS. We present here an approach called Syntactically Enhanced LSA (SELSA) which generalizes LSA by considering a word along with its syntactic neighborhood given by the part-of-speech tag of its preceding word, as a unit of knowledge representation. The experimental results on Auto-Tutor task to evaluate students answers to basic computer science questions by SELSA and its comparison with LSA are presented in terms of several cognitive measures. SELSA is able to correctly evaluate a few more answers than LSA but is having less correlation with human evaluators than LSA has. It also provides better discrimination of syntactic-semantic knowledge representation than LSA.
international conference on acoustics, speech, and signal processing | 2003
Arun Kumar; Ashish Verma
Voice conversion techniques attempt to modify the speech signal so that it is perceived as if spoken by another speaker, different from the original speaker. In this paper, we present a novel approach to perform voice conversion. Our approach uses acoustic models based on units of speech, like phones and diphones, for voice conversion. These models can be computed and used independently for a given speaker without being concerned about the source or target speaker. It avoids the use of a parallel speech corpus in the voices of source and target speakers. It is shown that by using the proposed approach, voice fonts can be created and stored which represent individual characteristics of a particular speaker, to be used for customization of synthetic speech. We also show through objective and subjective tests, that voice conversion quality is comparable to other approaches that require a parallel speech corpus.
IEEE Signal Processing Letters | 2007
Abhijit Karmakar; Arun Kumar; R. K. Patney
A criterion is proposed to obtain an optimal wavelet packet (WP) tree based on the critical band structure of the human auditory system for time-frequency decomposition of speech and audio signals. The criterion minimizes a perceptual cost function based on Zwickers model of the critical band structure and allocates an optimal number of terminating nodes at different decomposition depths of the WP tree. The criterion is used to obtain the optimal WP tree and the corresponding critical band ordered wavelet packet basis for some typical sampling frequencies
Iete Journal of Research | 2010
Kartik Audhkhasi; Arun Kumar
Abstract This paper proposes a novel two-scale auditory feature based algorithm for non-intrusive evaluation of speech quality. The neuron firing probabilities along the length of the basilar membrane, from an explicit auditory model, are used to extract features from the distorted speech signal. This is in contrast to previous methods, which either use standard vocal tract based features, or incorporate only some aspects of the human auditory perception mechanism. The features are extracted at two scales, namely a global scale spanning all voiced frames in an utterance, and a local scale spanning voiced frames from contiguous voiced segments in the utterance. This is followed by a simple information fusion at the score level using Gaussian Mixture Models (GMMs). The use of an explicit auditory model to extract features is based on the premise that similar processing (in a qualitative sense) happens in human speech perception. In addition, auditory feature extraction at two scales incorporates the effects of both long term and short term distortions on speech quality. The proposed algorithm is shown to perform at least as good as the ITU-T Recommendation P.563.
IEEE Signal Processing Letters | 1997
Arun Kumar; Allen Gersho
A technique for nonlinear prediction of speech via local linear prediction (LLP) is presented and applied to LD-CELP at 16 kbps. With 18th-order backward adaptive LLP for voiced frames, the hybrid LD-CELP coder gives higher segmental signal-to-noise ratio (SNR) compared to a reference version of the ITU-T G.728 LD-CELP algorithm, which has a 50th-order backward adaptive linear predictor. The computational complexity for LLP analysis is significantly less than that of a conventional one-step recursive LLP, and the LLP method gives better prediction gain and a remarkably whiter residual compared to backward adaptive linear predictor. With an appropriate state space neighborhood for local linear analysis, the short-delay predictor is also able to effectively model long-term correlations without requiring pitch estimation.
IEEE Transactions on Signal Processing | 1992
Arun Kumar; Daniel R. Fuhrmann; Michael Frazier; Bjorn D. Jawerth
The psi-decomposition of a signal, in which the signal is written as a weighted sum of certain elementary synthesizing functions, is described. The set S of synthesizing functions consists of dilated and translated copies of two parent functions, which are concentrated in both the time and the frequency domains. The weighting constants in the psi-decomposition define a transform called the phi-transform. The phi-transform of a signal captures both the frequency content and the temporal evolution of a nonstationary signal. The phi-transform is linear, continuous, and continuously invertible. The set S of synthesizing functions used in the psi-decomposition is nonorthogonal, hence considerable flexibility is permitted in its construction. It is shown with the help of two examples that the set S is easy to construct. >
Philosophical Magazine | 2011
Prasenjit Khanikar; Arun Kumar; Anandh Subramaniam
Two conditions under which image forces become significant are when a dislocation is close to a surface (or interface) or when the dislocation is in a nanocrystal. This investigation pertains to the calculation of image forces under these circumstances. A simple edge dislocation is simulated using finite element method (FEM) by feeding-in the appropriate stress-free strains in idealised domains, corresponding to the introduction of an extra half-plane of atoms. Following basic validation of the new model, the energy of the system as a function of the position of the simulated dislocation is plotted and the gradient of the curve gives the image force. The reduction in energy of the system arises from two aspects: firstly, due to the position of the dislocation in the domain and, secondly, due to deformations to the domain (/surfaces). The second aspect becomes important when the dislocation is positioned near a free-surface or in nanocrystals and can be calculated using the current methodology without constructing fictitious images. It is to be noted that domain deformations have been ignored in the standard theories for the calculation of image forces and, hence, they give erroneous results (magnitude and/or direction) whenever image forces play an important role. An important point to be noted is that, under certain circumstances, where domain deformations occur in the presence of an edge dislocation, the ‘image can be negative (attractive), zero or even positive (repulsive). The current model is extended to calculate image forces based on the usual concept of an ‘image dislocation’.
International Journal of Speech Technology | 2013
Rajesh Kumar Dubey; Arun Kumar
Quality estimation of speech is essential for monitoring and maintenance of the quality of service at different nodes of modern telecommunication networks. It is also required in the selection of codecs in speech communication systems. There is no requirement of the original clean speech signal as a reference in non-intrusive speech quality evaluation, and thus it is of importance in evaluating the quality of speech at any node of the communication network. In this paper, non-intrusive speech quality assessment of narrowband speech is done by Gaussian Mixture Model (GMM) training using several combinations of auditory perception and speech production features, which include principal components of Lyon’s auditory model features, MFCC, LSF and their first and second differences. Results are obtained and compared for several combinations of auditory features for three sets of databases. The results are also compared with ITU-T Recommendation P.563 for non-intrusive speech quality assessment. It is found that many combinations of these feature sets outperform the ITU-T P.563 Recommendation under the test conditions.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Abhijit Karmakar; Arun Kumar; R. K. Patney
This paper proposes a multiresolution model of auditory excitation pattern and applies it to the problem of objective evaluation of subjective wideband speech quality. The model uses wavelet packet transform for time-frequency decomposition of the input signal. The selection of the wavelet packet tree is based on an optimality criterion formulated to minimize a cost function based on the critical band structure. The models of the different auditory phenomena are reformulated for the multiresolution framework. This includes the proposition of duration dependent outer and middle ear weighting, multiresolution spectral spreading, and multiresolution temporal smearing. As an application, the excitation pattern is used to define an objective measure of auditory distortion of a distorted speech signal compared to the undistorted one. The performance of this objective measure is evaluated with a database of various kinds of NOISEX-92 degraded wideband speech signals in predicting the subjective mean opinion score (MOS) and is compared with the fast Fourier transform (FFT)-based ITU-T PESQ P.862.2 algorithm. The proposed measure is found to achieve comparable correlation between subjective MOS and objective MOS as PESQ P.862.2, with a trend suggesting better correlation for the nonstationary degradations compared to the stationary ones. Further refinement of the measure for distortion types other than additive noise is anticipated
