Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Soundararajan Srinivasan is active.

Publication


Featured researches published by Soundararajan Srinivasan.


Speech Communication | 2006

Binary and Ratio Time-frequency Masks for Robust Speech Recognition

Soundararajan Srinivasan; Nicoleta Roman; DeLiang Wang

A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequency units that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimate an ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stronger than the interference. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is increased.


Computer Speech & Language | 2010

A computational auditory scene analysis system for speech segregation and robust speech recognition

Yang Shao; Soundararajan Srinivasan; Zhaozhang Jin; DeLiang Wang

A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance.


international conference on acoustics, speech, and signal processing | 2009

An auditory-based feature for robust speech recognition

Yang Shao; Zhaozhang Jin; DeLiang Wang; Soundararajan Srinivasan

A conventional automatic speech recognizer does not perform well in the presence of noise, while human listeners are able to segregate and recognize speech in noisy conditions. We study a novel feature based on an auditory periphery model for robust speech recognition. Specifically, gammatone frequency cepstral coefficients are derived by applying a cepstral analysis on gammatone filterbank responses. Our evaluations show that the proposed feature performs considerably better than conventional acoustic features. We further demonstrate that integrating the proposed feature with a computational auditory scene analysis system yields promising recognition performance.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Transforming Binary Uncertainties for Robust Speech Recognition

Soundararajan Srinivasan; DeLiang Wang

Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time-frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time-frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions.


wearable and implantable body sensor networks | 2010

Multisensor Fusion in Smartphones for Lifestyle Monitoring

Raghu K. Ganti; Soundararajan Srinivasan; Aca Gacic

Smartphones with diverse sensing capabilities are becoming widely available and pervasive in use. With the phone becoming a mobile personal computer, integrated applications can use multi-sensory data to derive information about the users actions and the context in which these actions occur. This paper develops a novel method to assess daily living patterns using a smartphone equipped with microphones and inertial sensors. We develop a feature-space combination approach for fusion of information from sensors sampled at different rates and present a computationally light-weight algorithm to identify various high level activities. Preliminary results from an initial deployment among eight users indicate the potential for accurate, context-aware, and personalized sensing.


Journal of the Acoustical Society of America | 2006

Binaural segregation in multisource reverberant environments

Nicoleta Roman; Soundararajan Srinivasan; DeLiang Wang

In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor.


international conference on acoustics, speech, and signal processing | 2007

Incorporating Auditory Feature Uncertainties in Robust Speaker Identification

Yang Shao; Soundararajan Srinivasan; DeLiang Wang

Conventional speaker recognition systems perform poorly under noisy conditions. Recent research suggests that binary time-frequency (T-F) masks be a promising front-end for robust speaker recognition. In this paper, we propose novel auditory features based on an auditory periphery model, and show that these features capture significant speaker characteristics. Additionally, we estimate uncertainties of the auditory features based on binary T-F masks, and calculate speaker likelihood scores using uncertainty decoding. Our approach achieves substantial performance improvement in a speaker identification task compared with a state-of-the-art robust front-end in a wide range of signal-to-noise conditions.


international conference of the ieee engineering in medicine and biology society | 2007

Towards automatic detection of falls using wireless sensors

Soundararajan Srinivasan; Jun Han; Dhananjay Lal; Aca Gacic

Accurate detection of falls leading to injury is essential for providing timely medical assistance. In this paper, we describe a wireless sensor network system for automatic fall detection. To detect falls, we use a combination of a body- worn triaxial accelerometer with motion detectors placed in the monitored area. While accelerometer provides information about the body motion during a fall, motion detectors monitor general presence or absence of motion. From all sensors, the data is transmitted wirelessly using the IEEE 802.15.4 protocol to a central node for processing. We use an implementation of carrier sense multiple access - collision avoidance scheme for channel reuse. A simple forwarding scheme is used to provide an extended coverage for a home environment. Fall detection is accomplished by a two-stage algorithm that utilizes the triaxial acceleration and the motion data sequentially. In the first stage, the algorithm detects plausible falls using a measure of normalized energy expenditure computed from the dynamic acceleration values. In the second stage, falls are confirmed based on the absence of motion. Systematic evaluation on simulated falls using 15 adult subjects shows that the proposed system provides a highly promising solution for real-time fall detection.


Speech Communication | 2005

A schema-based model for phonemic restoration

Soundararajan Srinivasan; DeLiang Wang

Phonemic restoration is the perceptual synthesis of phonemes when masked by appropriate replacement sounds by utilizing linguistic context. Current models attempting to accomplish acoustic restoration of phonemes, however, use only temporal continuity and produce poor restoration of unvoiced phonemes, and are also limited in their ability to restore voiced phonemes. We present a schema-based model for phonemic restoration. The model employs a missing data speech recognition system to decode speech based on intact portions and activates word templates corresponding to the words containing the masked phonemes. An activated template is dynamically time warped to the noisy word and is then used to restore the speech frames corresponding to the masked phoneme, thereby synthesizing it. The model is able to restore both voiced and unvoiced phonemes with a high degree of naturalness. Systematic testing shows that this model outperforms a Kalman-filter based model.


Journal of the Acoustical Society of America | 2008

A model for multitalker speech perception.

Soundararajan Srinivasan; DeLiang Wang

A listeners ability to understand a target speaker in the presence of one or more simultaneous competing speakers is subject to two types of masking: energetic and informational. Energetic masking takes place when target and interfering signals overlap in time and frequency resulting in portions of target becoming inaudible. Informational masking occurs when the listener is unable to distinguish target and interference, while both are audible. A computational model of multitalker speech perception is presented to account for both types of masking. Human perception in the presence of energetic masking is modeled using a speech recognizer that treats the masked time-frequency units of target as missing data. The effects of informational masking are modeled as errors in target segregation by a speech separation system. On a systematic evaluation, the performance of the proposed model is in broad agreement with the results of a recent perceptual study.

Collaboration


Dive into the Soundararajan Srinivasan's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yang Shao

Ohio State University

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge