Sorin Dusan
Rutgers University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sorin Dusan.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Sorin Dusan; James L. Flanagan; Amod Karve; Mridul Balaraman
Methods for speech compression aim at reducing the transmission bit rate while preserving the quality and intelligibility of speech. These objectives are antipodal in nature since higher compression presupposes preserving less information about the original speech signal. This paper presents a method for compressing speech based on polynomial approximations of the trajectories in time of various speech features (i.e., spectrum, gain, and pitch). The compression method can be integrated into frame-based speech coders, and can also be applied to features that can be represented as temporal series greater in duration than the frame interval. Theoretical issues and experimental results regarding this type of compression are addressed in this paper. Experimental implementation into a 2400 b/s standard speech coder is reported along with objective and subjective evaluations of operation in various noise environments. The new speech coder operates at a transmission rate of 1533 b/s, and for all noisy conditions tested performs better than the 2400 b/s standard speech coder
international conference on multimodal interfaces | 2002
Sorin Dusan; James L. Flanagan
Communicating by voice with speech-enabled computer applications based on preprogrammed rule grammars suffers from constrained vocabulary and sentence structures. Deviations from the allowed language result in an unrecognized utterance that will not be understood and processed by the system. One way to alleviate this restriction consists in allowing the user to expand the computers recognized and understood language by teaching the computer system new language knowledge. We present an adaptive dialog system capable of learning from users new words, phrases and sentences, and their corresponding meanings. User input incorporates multiple modalities, including speaking, typing, pointing, drawing and image capturing. The allowed language can thus be expanded in real time by users according to their preferences. By acquiring new language knowledge the system becomes more capable in specific tasks, although its language is still constrained.
international conference on acoustics, speech, and signal processing | 2006
Jun Hou; Lawrence R. Rabiner; Sorin Dusan
In this paper we discuss the design and implementation of the ASAT front end processing system, whose goal is to convert the speech waveform into a range of measurements and parameters which are then combined to form probabilistic attributes. The ASAT front end processing module utilizes a range of spectral and temporal speech parameters as input to a set of neural network classifiers to create sets of attribute probability lattices, based on either single frames or blocks of frames (segments). We test this architecture by using the 14 Sound Patterns of English (SPE) features as speech attributes. Without balancing the training data, the detection accuracies of 4 of the SPE features are above 90%, 2 features obtain between 80% and 90% detection accuracy, and 8 features have detection accuracies below 80%. With a novel method of balancing the feature training data, the performance of the neural networks improved significantly, with 6 features having detection accuracies above 90% and the remaining 8 features with detection accuracy above 80%
Speech Communication | 2007
Sorin Dusan
Many previous studies suggested that the information necessary for the identification of vowels from continuous speech is distributed both within and outside vowel boundaries. This information appears to be embedded in the speech signal in the form of various acoustic cues or patterns: spectral, energy, static, dynamic, and temporal. In a recent paper we identified seven types of acoustic patterns that might be exploited by listeners in the identification of coarticulated vowels. The current paper extends the previous study and quantizes the relevance for vowel classification of eight types of acoustic patterns, including static spectral patterns, dynamical spectral patterns, and temporal-durational patterns. Four of these eight patterns are not directly exploited by current automatic speech recognition techniques in computing the likelihood of each phonetic model. These four new patterns proved to contain significant vowel information. Two of these four new patterns represent static spectral patterns lying outside of the currently accepted boundaries of vowels, whereas one is a double-slope dynamical pattern and another one is a simple durational pattern. The findings of this paper may be important for both automatic speech recognition models and models of vowel/phoneme perception by humans.
Journal of the Acoustical Society of America | 2004
Sorin Dusan
In fluent speech one would expect that the transition between two successive acoustic targets would exhibit either a monotonic increasing or a monotonic decreasing trajectory in the time domain of each spectral parameter. Closer examination reveals that this is not always the case. This study investigates the existence of non‐monotonic trajectories found in the acoustic spectral domain at the transition between some successive phonemes. This non‐monotonic behavior consists of transitional regions that exhibit either increasing and then decreasing or decreasing and then increasing trajectories for some acoustic spectral parameters. Using the superposition principle, the training part of the TIMIT acoustic‐phonetic database is used to build 2471 diphone trajectory models, based on the 61 symbols used in the database for phonetic transcription. Various non‐monotonic trajectories are found in some of these models for some spectral representations including the linear prediction coding (LPC) parameters and the...
Journal of the Acoustical Society of America | 2002
Sorin Dusan; James L. Flanagan
Speech has become increasingly important in human–computer interaction. Spoken dialog interfaces rely on automatic speech recognition, speech synthesis, language understanding, and dialog management. A main issue in dialog systems is that they typically are limited to pre‐programmed vocabularies and sets of sentences. The research reported here focuses on developing an adaptive spoken dialog interface capable of acquiring new linguistic units and their corresponding semantics during the human–computer interaction. The adaptive interface identifies unknown words and phrases in the users utterances and asks the user for the corresponding semantics. The user can provide the meaning or the semantic representation of the new linguistic units through multiple modalities, including speaking, typing, pointing, touching, or showing. The interface then stores the new linguistic units in a semantic grammar and creates new objects defining the corresponding semantic representation. This process takes place during nat...
Journal of the Acoustical Society of America | 2002
Sorin Dusan; James L. Flanagan
In the U.S. Federal Standard coder for 2400 bps, a data frame containing 54 bits of encoded signal is transmitted every 22.5 ms. In each frame, 25 bits encode the spectral features (10 Line Spectrum Frequencies—LSF). In this paper we describe a method for reducing the transmission rate while preserving most of the quality and intelligibility. This method is based on modeling the spectral trajectories with polynomial functions and on encoding these functions for segments of speech extending over multiple frames. Here 10 polynomials are computed by fitting them to the 10 LSF trajectories in the least‐squares sense. Then the polynomial coefficients are encoded for the whole segment instead of directly encoding the LSF vectors. The spectral parameters are thus reduced (compressed) to [(P+1)/N]×100%, where P represents the order of the polynomials and N the number of frames for each segment. Different compression rates can be achieved. For example, for P=5 and N=10 the spectral features are encoded using 40% l...
Journal of the Acoustical Society of America | 2004
Mridul Balaraman; Sorin Dusan; James L. Flanagan
Traditional speech recognition systems use mel‐frequency cepstral coefficients (MFCCs) as acoustic features. The present research aims to study the classification characteristics and the performance of some supplementary features (SFs) such as periodicity, zero crossing rate, log energy and ratio of low frequency energy to total energy, in a phone recognition system, built using the Hidden Markov Model Tool Kit. To demonstrate the performance of the SFs, training is done on a subset of the TIMIT data base (DR1 data set) on context independent phones using a single mixture. When only the SFs and their first derivatives (feature set of dimension 8) are used the recognition accuracy is found to be 42.96% as compared to 54.65% when 12 MFCCs and their corresponding derivatives are used. The performance of the system improves to 56.49%, when the SFs and their derivatives are used along with the MFCCs. A further improvement to 60.34% is observed when the last 4 MFCCs and their derivatives are replaced by SFs and...
Journal of the Acoustical Society of America | 2004
Sorin Dusan
In an experimental study of identification of truncated Japanese syllables it was stated that a short speech interval (approximately 10 ms) that includes the position of maximum spectral transition between a consonant and a vowel carries the most important information for the perception of the consonant and the syllable [S. Furui, J. Acoust. Soc. Am. 80, 1016–1025 (1986)]. A reduced spectral variability at the transition position could partially explain the increase of information at this position. The current study investigates whether there is a decrease in spectral variability at the transition position between successive phonemes compared with the spectral variability at the phoneme centers. The training part of the TIMIT acoustic‐phonetic database, containing sentences in English from 462 American speakers, is used to build 2471 diphone models, based on the 61 symbols used in the database for phonetic transcription. The variability of the mel‐frequency cepstral coefficients (MFCCs) is evaluated for v...
Archive | 2005
Sorin Dusan; James L. Flanagan
It is difficult for a developer to account for all the surface linguistic forms that users might need in a spoken dialogue computer application. In any specific case users might need additional concepts not pre-programmed by the developer. This chapter presents a method for adapting the vocabulary of a spoken dialogue interface at run-time by end-users. The adaptation is based on expanding existing pre-programmed concept classes by adding new concepts in these classes. This adaptation is classified as a supervised learning method in which users are responsible for indicating the concept class and the semantic representation for the new concepts. This is achieved by providing users with a number of rules and ways in which the new language knowledge can be supplied to the computer. Acquisition of new linguistic knowledge at the surface and semantic levels is done using multiple modalities, including speaking, typing, pointing, touching or image capturing. Language knowledge is updated and stored in a semantic grammar and a semantic database.