Shiva Sundaram
Deutsche Telekom
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shiva Sundaram.
IEEE Signal Processing Magazine | 2012
Tara N. Sainath; Bhuvana Ramabhadran; David Nahamoo; Dimitri Kanevsky; Dirk Van Compernolle; Kris Demuynck; Jort F. Gemmeke; Jerome R. Bellegarda; Shiva Sundaram
Solving real-world classification and recognition problems requires a principled way of modeling the physical phenomena generating the observed data and the uncertainty in it. The uncertainty originates from the fact that many data generation aspects are influenced by nondirectly measurable variables or are too complex to model and hence are treated as random fluctuations. For example, in speech production, uncertainty could arise from vocal tract variations among different people or corruption by noise. The goal of modeling is to establish a generalization from the set of observed data such that accurate inference (classification, decision, recognition) can be made about the data yet to be observed, which we refer to as unseen data.
international conference on acoustics, speech, and signal processing | 2012
Shiva Sundaram; Jerome R. Bellegarda
In recent work, we introduced Latent Perceptual Mapping (LPM) [1], a new framework for acoustic modeling suitable for template-like speech recognition. The basic idea is to leverage a reduced dimensionality description of the observations to derive acoustic prototypes that are closely aligned with perceived acoustic events. Our initial work adopted a bag-of-frames strategy to represent relevant acoustic information within speech segments. In this paper, we extend this approach by better integrating temporal information into the LPM feature extraction. Specifically, we use variable-length units to represent acoustic events at the supra-frame level, in order to benefit from finer temporal alignments when deriving the acoustic prototypes. The outcome can be viewed as a generalization of both conventional template-based approaches and recently proposed sparse representation solutions. This extension is experimentally validated on a context-independent phoneme classification task using the TIMIT corpus.
workshop on applications of signal processing to audio and acoustics | 2009
Samuel Kim; Shrikanth Narayanan; Shiva Sundaram
A new algorithm for content-based audio information retrieval is introduced in this work. Assuming that there exist hidden acoustic topics and each audio clip is a mixture of those acoustic topics, we proposed a topic model that learns a probability distribution over a set of hidden topics of a given audio clip in an unsupervised manner. We use the Latent Dirichlet Allocation (LDA) method for the topic model, and introduce the notion of acoustic words for supporting modeling within this framework. In audio description classification tasks using Support Vector Machine (SVM) on the BBC database, the proposed acoustic topic model shows promising results by outperforming the Latent Perceptual Indexing (LPI) method in classifying onomatopoeia descriptions and semantic descriptions.
international conference on acoustics, speech, and signal processing | 2008
Shiva Sundaram; Shrikanth Narayanan
We present a query-by-example audio retrieval framework by indexing audio clips in a generic database as points in a latent perceptual space. First, feature-vectors extracted from the clips in the database are grouped into reference clusters using an unsupervised clustering technique. An audio clip-to-cluster matrix is constructed by keeping count of the number of features that are quantized into each of the reference clusters. By singular-value decomposition of this matrix, each audio clip of the database is mapped into a a point in the latent perceptual space. This is used for indexing the retrieval system. Since each of the initial reference clusters represents a specific perceptual quality in a perceptual space (similar to words that represent specific concepts in the semantic space), querying-by-example results in clips that have similar perceptual qualities. Subjective human evaluation indicates about 75% retrieval performance. Evaluation on semantic categories reveals that the system performance is comparable to other proposed methods.
Journal of the Acoustical Society of America | 2007
Shiva Sundaram; Shrikanth Narayanan
A technique to synthesize laughter based on time-domain behavior of real instances of human laughter is presented. In the speech synthesis community, interest in improving the expressive quality of synthetic speech has grown considerably. While the focus has been on the linguistic aspects, such as precise control of speech intonation to achieve desired expressiveness, inclusion of nonlinguistic cues could further enhance the expressive quality of synthetic speech. Laughter is one such cue used for communicating, say, a happy or amusing context. It can be generated in many varieties and qualities: from a short exhalation to a long full-blown episode. Laughter is modeled at two levels, the overall episode level and at the local call level. The first attempts to capture the overall temporal behavior in a parametric model based on the equations that govern the simple harmonic motion of a mass-spring system is presented. By changing a set of easily available parameters, the authors are able to synthesize a variety of laughter. At the call level, the authors relied on a standard linear prediction based analysis-synthesis model. Results of subjective tests to assess the acceptability and naturalness of the synthetic laughter relative to real human laughter samples are presented.
international conference on multimedia and expo | 2008
Shiva Sundaram; Shrikanth Narayanan
Using the recently proposed framework for latent perceptual indexing of audio clips, we present classification of whole clips categorized by two schemes: high-level semantic labels and the mid-level perceptually motivated onomatopoeia labels. First, feature-vectors extracted from the clips in the database are grouped into reference clusters using an unsupervised clustering technique. A unit-document co-occurrence matrix is then obtained by quantizing the feature-vectors extracted from the audio clips into the reference clusters. The audio clips are then mapped to a latent perceptual space by the reduced rank approximation of this matrix. The classification experiments are performed in this representation space using corresponding semantic and onomatopoeic labels of the clips. Using the proposed method, classification accuracy of about sixty percent was obtained when tested on the BBC sound effects library using over twenty categories. Having the two labeling schemes together in a single framework makes the classification system more flexible as each scheme addresses the limitation of the other. These aspects are the main motivation of the work presented here.
multimedia signal processing | 2009
Ozlem Kalinli; Shiva Sundaram; Shrikanth Narayanan
Automatic acoustic scene classification of real life, complex and unstructured acoustic scenes is a challenging task as the number of acoustic sources present in the audio stream are unknown and overlapping in time. In this work, we present a novel approach to classification such unstructured acoustic scenes. Motivated by the bottom-up attention model of the human auditory system, salient events of an audio clip are extracted in an unsupervised manner and presented to the classification system. Similar to latent semantic indexing of text documents, the classification system uses unit-document frequency measure to index the clip in a continuous, latent space. This allows for developing a completely class-independent approach to audio classification. Our results on the BBC sound effects library indicates that using the saliency-driven attention selection approach presented in this paper, 17.5% relative improvement can be obtained in frame-based classification and 25% relative improvement can be obtained using the latent audio indexing approach.
international conference on acoustics, speech, and signal processing | 2006
Shrikanth Narayanan; Panayiotis G. Georgiou; Abhinav Sethy; Dagen Wang; Murtaza Bulut; Shiva Sundaram; Emil Ettelaie; Sankaranarayanan Ananthakrishnan; Horacio Franco; Kristin Precoda; Dimitra Vergyri; Jing Zheng; Wen Wang; Ramana Rao Gadde; Martin Graciarena; Victor Abrash; Michael W. Frandsen; Colleen Richey
Engineering automatic speech recognition (ASR) for speech to speech (S2S) translation systems, especially targeting languages and domains that do not have readily available spoken language resources, is immensely challenging due to a number of reasons. In addition to contending with the conventional data-hungry speech acoustic and language modeling needs, these designs have to accommodate varying requirements imposed by the domain needs and characteristics, target device and usage modality (such as phrase-based, or spontaneous free form interactions, with or without visual feedback) and huge spoken language variability arising due to socio-linguistic and cultural differences of the users. This paper, using case studies of creating speech translation systems between English and languages such as Pashto and Farsi, describes some of the practical issues and the solutions that were developed for multilingual ASR development. These include novel acoustic and language modeling strategies such as language adaptive recognition, active-learning based language modeling, class-based language models that can better exploit resource poor language data, efficient search strategies, including N-best and confidence generation to aid multiple hypotheses translation, use of dialog information and clever interface choices to facilitate ASR, and audio interface design for meeting both usability and robustness requirements
international conference on acoustics, speech, and signal processing | 2007
Shiva Sundaram; Shrikanth Narayanan
We present an analysis of clustering audio clips using word descriptions that are imitative of sounds. These onomatopoeia words describe the acoustic properties of sources, and they can be useful in annotating a medium that cannot embed audio (e.g. text). First, an audio-to-word relationship is established by manually tagging a variety of audio clips (from a sound effects library) with onomatopoeia words. Using a newly proposed distance metric for word-level similarities, the feature vectors from the audio are clustered according to their tags, resulting in clusters with similarities in their onomatopoeic descriptions. By discriminant analysis of the clusters at the feature level, we present results on separability of these clusters. Our results indicate that by just using onomatopoeic descriptions, meaningful clusters with similar acoustic properties can be formed. However, in terms of audio feature level representation, clusters formed by some word groups such as buzz, fizz etc are better represented by signal features than percussive sounds such as clang, clank, tap.
multimedia signal processing | 2010
Samuel Kim; Shiva Sundaram; Panayiotis G. Georgiou; Shrikanth Narayanan
An N-gram modeling approach for unstructured audio signals is introduced with applications to audio information retrieval. The proposed N-gram approach aims to capture local dynamic information in acoustic words within the acoustic topic model framework which assumes an audio signal consists of latent acoustic topics and each topic can be interpreted as a distribution over acoustic words. Experimental results on classifying audio clips from BBC Sound Effects Library according to both semantic and onomatopoeic labels indicate that the proposed N-gram approach performs better than using only a bag-of-words approach by providing complementary local dynamic information.