Is this you? Create Your Porfile

Sivanand Achanta

International Institute of Information Technology, Hyderabad

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sivanand Achanta is active.

Explore More

Publication

Featured researches published by Sivanand Achanta.

IEEE Transactions on Audio, Speech, and Language Processing | 2014

Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping

Gautam Varma Mantena; Sivanand Achanta; Kishore Prahallad

The task of query-by-example spoken term detection (QbE-STD) is to find a spoken query within spoken audio data. Current state-of-the-art techniques assume zero prior knowledge about the language of the audio data, and thus explore dynamic time warping (DTW) based techniques for the QbE-STD task. In this paper, we use a variant of DTW based algorithm referred to as non-segmental DTW (NS-DTW), with a computational upper bound of O (mn) and analyze the performance of QbE-STD with Gaussian posteriorgrams obtained from spectral and temporal features of the speech signal. The results show that frequency domain linear prediction cepstral coefficients, which capture the temporal dynamics of the speech signal, can be used as an alternative to traditional spectral parameters such as linear prediction cepstral coefficients, perceptual linear prediction cepstral coefficients and Mel-frequency cepstral coefficients. We also introduce another variant of NS-DTW called fast NS-DTW (FNS-DTW) which uses reduced feature vectors for search. With a reduction factor of α ∈ ℕ, we show that the computational upper bound for FNS-DTW is O(mn/(α2)) which is faster than NS-DTW.

conference of the international speech communication association | 2016

An Investigation of Deep Neural Network Architectures for Language Recognition in Indian Languages.

K. V. Mounika; Sivanand Achanta; H R Lakshmi; Suryakanth V. Gangashetty; Anil Kumar Vuppala

In this paper, deep neural networks are investigated for language identification in Indian languages. Deep neural networks (DNN) have been recently proposed for this task. However many architectural choices and training aspects that have been made while building such systems have not been studied carefully. We perform several experiments on a dataset consisting of 12 Indian languages with a total training data of about 120 hours in evaluating the effect of such choices. While DNN based approach is inherently a frame based one, we propose an attention mechanism based DNN architecture for utterance level classification there by efficiently making use of the context. Evaluation of models were performed on 30 hours of testing data with 2.5 hours for each language. In our results, we find that deeper architectures outperform shallower counterparts. Also, DNN with attention mechanism outperforms the regular DNN models indicating the effectiveness of attention mechanism.

international joint conference on neural network | 2016

Analysis of sequence to sequence neural networks on grapheme to phoneme conversion task.

Sivanand Achanta; Ayushi Pandey; Suryakanth V. Gangashetty

In this paper, we analyze the performance of various sequence to sequence neural networks on the task of grapheme to phoneme (G2P) conversion. G2P is a very important component in applications like text-to-speech, automatic speech recognition etc,. Because the number of graphemes that a word consists of and the corresponding number of phonemes are different, they are first aligned and then mapped. With the recent advent of sequence to sequence neural networks, the alignment step can be skipped allowing us to directly map the input and output sequences. Although the sequence to sequence neural nets have been applied for this task very recently, there are some questions concerning the architecture that need to be addressed. We show in this paper that, complex recurrent neural network units (like long-short term memory cells) may not be required to achieve good performance on this task. Instead simple recurrent neural networks (RNN) will suffice. We also show that the encoder can be a uni-directional RNN as opposed to the usually preferred bi-directional RNN. Further, our experiments reveal that encoder-decoder models with soft-alignment outperforms fixed vector context counterpart. The results demonstrate that with very few parameters we can indeed achieve comparable performance to much more complicated architectures.

9th ISCA Speech Synthesis Workshop | 2016

Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis.

Sivanand Achanta; Rambabu Banoth; Ayushi Pandey; Anandaswarup Vadapalli; Suryakanth V. Gangashetty

In this paper, we propose to use hidden state vector obtained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach performs significantly better than the baseline DNN system.

workshop spoken language technologies for under resourced languages | 2018

Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System

Hari Krishna; Sivanand Achanta; Anil Kumar Vuppala

Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be incorporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.

Speech Communication | 2017

Deep Elman recurrent neural networks for statistical parametric speech synthesis

Sivanand Achanta; Suryakanth V. Gangashetty

Abstract Owing to the success of deep learning techniques in automatic speech recognition, deep neural networks (DNNs) have been used as acoustic models for statistical parametric speech synthesis (SPSS). DNNs do not inherently model the temporal structure in speech and text, and hence are not well suited to be directly applied to the problem of SPSS. Recurrent neural networks (RNN) on the other hand have the capability to model time-series. RNNs with long short-term memory (LSTM) cells have been shown to outperform DNN based SPSS. However, LSTM cells and its variants like gated recurrent units (GRU), simplified LSTMs (SLSTM) have complicated structure and are computationally expensive compared to the simple recurrent architecture like Elman RNN. In this paper, we explore deep Elman RNNs for SPSS and compare their effectiveness against deep gated RNNs. Specifically, we perform experiments to show that (1) Deep Elman RNNs are better suited for acoustic modeling in SPSS when compared to DNNs and perform competitively to deep SLSTMs, GRUs and LSTMs, (2) Context representation learning using Elman RNNs improves neural network acoustic models for SPSS, and (3) Elman RNN based duration model is better than the DNN based counterpart. Experiments were performed on Blizzard Challenge 2015 dataset consisting of 3 Indian languages (Telugu, Hindi and Tamil). Through subjective and objective evaluations, we show that our proposed systems outperform the baseline systems across different speakers and languages.

international conference on mining intelligence and knowledge exploration | 2016

A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models.

K. N. R. K. Raju Alluri; Sivanand Achanta; Rajendra Prasath; Suryakanth V. Gangashetty; Anil Kumar Vuppala

The present study focuses on the text-independent speaker recognition in emotional conditions. In this paper, both system and source features are considered to represent speaker specific information. At the model level, Gaussian Mixture Models (GMMs), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and Deep Neural Networks (DNN) are explored. The experiments are performed using 3 emotional databases, i.e. German emotional speech database (EMO-DB), IITKGP-SESC: Hindi and IITKGP-SESC: Telugu databases. The emotions considered in the present study are neutral, anger, happy and sad. The results show that, the performance of a speaker recognition system trained with clean speech is degrading while testing with emotional data irrespective of feature used or model used to build the system. The best results are obtained for the score level fusion of system and source features based systems when speakers are modeled with DNNs.

conference of the international speech communication association | 2015