Shiv Vitaladevuni | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shiv Vitaladevuni is active.

Explore More

Publication

Featured researches published by Shiv Vitaladevuni.

international conference on machine learning and applications | 2015

Model Shrinking for Embedded Keyword Spotting

Ming Sun; Varun Nagaraja; Bjorn Hoffmeister; Shiv Vitaladevuni

In this paper we present two approaches to improve computational efficiency of a keyword spotting system running on a resource constrained device. This embedded keyword spotting system detects a pre-specified keyword in real time at low cost of CPU and memory. Our system is a two stage cascade. The first stage extracts keyword hypotheses from input audio streams. After the first stage is triggered, hand-crafted features are extracted from the keyword hypothesis and fed to a support vector machine (SVM) classifier on the second stage. This paper focuses on improving the computational efficiency of the second stage SVM classifier. More specifically, select a subset of feature dimensions and merge the SVM classifier to a smaller size, while maintaining the keyword spotting performance. Experimental results indicate that we can remove more than 36% of the non-discriminative SVM features, and reduce the number of support vectors by more than 60% without significant performance degradation. This results in more than 15% relative reduction in CPU utilization.

spoken language technology workshop | 2016

Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting

Ming Sun; Anirudh Raju; George Tucker; Sankaran Panchapagesan; Gengshen Fu; Arindam Mandal; Spyros Matsoukas; Nikko Strom; Shiv Vitaladevuni

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67:6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.

conference of the international speech communication association | 2016

Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting.

Sankaran Panchapagesan; Ming Sun; Aparna Khare; Spyros Matsoukas; Arindam Mandal; Björn Hoffmeister; Shiv Vitaladevuni

We propose improved Deep Neural Network (DNN) training loss functions for more accurate single keyword spotting on resource-constrained embedded devices. The loss function modifications consist of a combination of multi-task training and weighted cross entropy. In the multi-task architecture, the keyword DNN acoustic model is trained with two tasks in parallel the main task of predicting the keyword-specific phone states, and an auxiliary task of predicting LVCSR senones. We show that multi-task learning leads to comparable accuracy over a previously proposed transfer learning approach where the keyword DNN training is initialized by an LVCSR DNN of the same input and hidden layer sizes. The combination of LVCSRinitialization and Multi-task training gives improved keyword detection accuracy compared to either technique alone. We also propose modifying the loss function to give a higher weight on input frames corresponding to keyword phone targets, with a motivation to balance the keyword and background training data. We show that weighted cross-entropy results in additional accuracy improvements. Finally, we show that the combination of 3 techniques LVCSR-initialization, multi-task training and weighted cross-entropy gives the best results, with significantly lower False Alarm Rate than the LVCSR-initialization technique alone, across a wide range of Miss Rates.

conference of the international speech communication association | 2016

Model Compression Applied to Small-Footprint Keyword Spotting.

George Tucker; Minhua Wu; Ming Sun; Sankaran Panchapagesan; Gengshen Fu; Shiv Vitaladevuni

Several consumer speech devices feature voice interfaces that perform on-device keyword spotting to initiate user interactions. Accurate on-device keyword spotting within a tight CPU budget is crucial for such devices. Motivated by this, we investigated two ways to improve deep neural network (DNN) acoustic models for keyword spotting without increasing CPU usage. First, we used low-rank weight matrices throughout the DNN. This allowed us to increase representational power by increasing the number of hidden nodes per layer without changing the total number of multiplications. Second, we used knowledge distilled from an ensemble of much larger DNNs used only during training. We systematically evaluated these two approaches on a massive corpus of far-field utterances. Alone both techniques improve performance and together they combine to give significant reductions in false alarms and misses without increasing CPU or memory usage.

international acm sigir conference on research and development in information retrieval | 2016

Search-based Evaluation from Truth Transcripts for Voice Search Applications

François Mairesse; Paul Raccuglia; Shiv Vitaladevuni

Voice search applications are typically evaluated by comparing the predicted query to a reference human transcript, regardless of the search results returned by the query. While we find that an exact transcript match is highly indicative of user satisfaction, a transcript which does not match the reference still produces satisfactory search results a significant fraction of the time. This paper therefore proposes an evaluation method that compares the search results of the speech recognition hypotheses with the search results produced by a human transcript. Compared with a strict sentence match, a human evaluation shows that search result overlap is a better predictor of (a) user satisfaction and (b) search result click-through. Finally, we propose a model predicting the Expected Search Satisfaction Rate (ESSR), conditioned on search overlap outcomes. On a held out set of 1036 voice search queries, our model predicted an ESSR within 0.9% (relative) of the ground truth satisfaction averaged over 3 human judges.

conference of the international speech communication association | 2017