Sudhamay Maity
Indian Institute of Technology Kharagpur
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sudhamay Maity.
international conference on contemporary computing | 2009
Shashidhar G. Koolagudi; Sudhamay Maity; Vuppala Anil Kumar; Saswat Chakrabarti; K. Sreenivasa Rao
In this paper, we are introducing the speech database for analyzing the emotions present in speech signals. The proposed database is recorded in Telugu language using the professional artists from All India Radio (AIR), Vijayawada, India. The speech corpus is collected by simulating eight different emotions using the neutral (emotion free) statements. The database is named as Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC). The proposed database will be useful for characterizing the emotions present in speech. Further, the emotion specific knowledge present in speech at different levels can be acquired by developing the emotion specific models using the features from vocal tract system, excitation source and prosody. This paper describes the design, acquisition, post processing and evaluation of the proposed speech database (IITKGP-SESC). The quality of the emotions present in the database is evaluated using subjective listening tests. Finally, statistical models are developed using prosodic features, and the discrimination of the emotions is carried out by performing the classification of emotions using the developed statistical models.
International Journal of Speech Technology | 2011
N. P. Narendra; K. Sreenivasa Rao; Krishnendu Ghosh; Ramu Reddy Vempada; Sudhamay Maity
This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.
International Journal of Speech Technology | 2013
K. Sreenivasa Rao; Sudhamay Maity; V. Ramu Reddy
This paper explores pitch synchronous and glottal closure (GC) based spectral features for analyzing the language specific information present in speech. For determining pitch cycles (for pitch synchronous analysis) and GC regions, instants of significant excitation (ISE) are used. The ISE correspond to the instants of glottal closure (epochs) in the case of voiced speech, and some random excitations like onset of burst in the case of nonvoiced speech. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. Proposed pitch synchronous and glottal closure spectral features are evaluated using language recognition studies. The evaluation results indicate that language recognition performance is better with pitch synchronous and GC based spectral features compared to conventional spectral features derived through block processing. GC based spectral features are found to be more robust against degradations due to background noise. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.
Expert Systems With Applications | 2011
K. Sreenivasa Rao; V. K. Saroj; Sudhamay Maity; Shashidhar G. Koolagudi
In this paper, facial features from the video sequence are explored for characterizing the emotions. The emotions considered for this study are Anger, Fear, Happy, Sad and Neutral. For carrying out the proposed emotion recognition study, the required video data is collected from the studio, Center for Education Technology (CET), at Indian Institute of Technology (IIT) Kharagpur. The dynamic nature of the grey values of the pixels within the eye and mouth regions are used as the features to capture the emotion specific knowledge from the facial expressions. Multiscale morphological erosion and dilation operations are used to extract features from eye and mouth regions, respectively. The features extracted from left eye, right eye and mouth regions are used to develop the separate models for each emotion category. Autoassociative neural network (AANN) models are used to capture the distribution of the extracted features. The developed models are validated using subject dependent and independent emotion recognition studies. The overall performance of the proposed emotion recognition system is observed to be about 87%.
national conference on communications | 2012
K. Sreenivasa Rao; Ketan Pachpande; Ramu Reddy Vempada; Sudhamay Maity
In this paper, we proposed two-stage segmentation approach for splitting the TV broadcast news bulletins into sequence of news stories. In the first stage, speaker (news reader) specific characteristics present in initial headlines of the news bulletin are used for gross level segmentation. During second stage, errors in the gross level segmentation (first stage) are corrected by exploiting the speaker specific information captured from the individual news stories other than headlines. During headlines the captured speaker specific information is mixed with background music, and hence the segmentation at the first stage may not be accurate. In this work speaker specific information is represented by using mel frequency cepstral coefficients (MFCCs), and it is captured by using Gaussian mixture models (GMMs). The proposed two-stage segmentation method is evaluated on manual segmented ten broadcast TV news bulletins. From the evaluation results, it is observed that about 93% of the news stories are correctly segmented, 7% are missed and 11% are spurious.
ieee india conference | 2011
N. P. Narendra; K. Sreenivasa Rao; Krishnendu Ghosh; Vempada Ramu Reddy; Sudhamay Maity
This paper discusses the development of Bengali screen reader using Festival speech synthesizer. Screen reader is developed with the objective that the visually challenged people can use the computer without any difficulty. The usability of system is checked throughout the development and appropriate modifications are made. Unrestricted Bengali text to speech synthesis (TTS) system which can produce good quality speech in different domains is integrated into the screen reader. Pruning of database is performed to reduce the response time of screen reader. Finally subjective evaluation of screen reader is carried out by the visually challenged people for different applications such as web browsing. Results indicate that the developed system can be used by visually challenged people independently without any external assistance.
Archive | 2015
K. Sreenivasa Rao; V. Ramu Reddy; Sudhamay Maity
This book discusses the impact of spectral features extracted from frame level, glottal closure regions, and pitch-synchronous analysis on the performance of language identification systems. In addition to spectral features, the authors explore prosodic features such as intonation, rhythm, and stress features for discriminating the languages. They present how the proposed spectral and prosodic features capture the language specific information from two complementary aspects, showing how the development of language identification (LID) system using the combination of spectral and prosodic features will enhance the accuracy of identification as well as improve the robustness of the system. This book provides the methods to extract the spectral and prosodic features at various levels, and also suggests the appropriate models for developing robust LID systems according to specific spectral and prosodic features. Finally, the book discuss about various combinations of spectral and prosodic features, and the desired models to enhance the performance of LID systems.
Archive | 2015
K. Sreenivasa Rao; V. Ramu Reddy; Sudhamay Maity
This chapter introduces multilingual Indian language speech corpus consisting of 27 regional Indian languages for analyzing the language identification (LID) performance. Speaker-dependent and independent language models are also discussed in view of LID. Spectral features extracted from conventional block processing, pitch synchronous analysis, and glottal closure regions are examined for discriminating the languages.
Archive | 2015
K. Sreenivasa Rao; V. Ramu Reddy; Sudhamay Maity
In previous chapter language-specific spectral features are discussed for language identification (LID). Present chapter mainly focuses on language-specific prosodic features at syllable, word and global levels for LID task. For improving the recognition accuracy of LID system further, combination of spectral and prosodic features has been explored.
pattern recognition and machine intelligence | 2009
K. Sreenivasa Rao; Sudhamay Maity; Amol Taru; Shashidhar G. Koolagudi
In this paper we propose a new method for unit selection in developing text-to-speech (TTS) system for Hindi. In the proposed method, syllables are used as basic units for concatenation. Linguistic, positional and contextual features derived from the input text are used at the first level in the unit selection process. The unit selection process is further refined by incorporating the prosodic and spectral characteristics at the utterance and syllable levels. The speech corpora considered for this task is the broadcast Hindi news read by a male speaker. Synthesized speech from the developed TTS system using multi-level unit selection criterion is evaluated using listening tests. From the evaluation results, it is observed that the synthesized speech quality has improved by refining the unit selection process using spectral and prosodic features.