Mohit Shah | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohit Shah is active.

Explore More

Publication

Featured researches published by Mohit Shah.

2012 IEEE International Conference on Emerging Signal Processing Applications | 2012

Lifelogging: Archival and retrieval of continuously recorded audio using wearable devices

Mohit Shah; Brian R. Mears; Chaitali Chakrabarti; Andreas Spanias

We propose a complete system for lifelogging where audio is continuously recorded using a smartphone or a wearable recorder. Recorded audio includes speech, music and environmental sounds. First, we describe a feature-based segmentation algorithm for breaking down a long piece of audio into smaller clips. In order to archive clips into a large database, we present methods for automatically indexing and annotating audio with relevant acoustic and semantic tags. Retrieval is performed using a Query-By-Example based approach. To support our claims, the results are demonstrated via a smart-phone application on the popular Android platform. Finally, we also propose a novel virtualization-based design framework to rapidly develop and test such systems for signal processing.

international conference on acoustics, speech, and signal processing | 2013

A speech emotion recognition framework based on latent Dirichlet allocation: Algorithm and FPGA implementation

Mohit Shah; Lifeng Miao; Chaitali Chakrabarti; Andreas Spanias

In this paper, we present a speech-based emotion recognition framework based on a latent Dirichlet allocation model. This method assumes that incoming speech frames are conditionally independent and exchangeable. While this leads to a loss of temporal structure, it is able to capture significant statistical information between frames. In contrast, a hidden Markov model-based approach captures the temporal structure in speech. Using the German emotional speech database EMO-DB for evaluation, we achieve an average classification accuracy of 80.7% compared to 73% for hidden Markov models. This improvement is achieved at the cost of a slight increase in computational complexity. We map the proposed algorithm onto an FPGA platform and show that emotions in a speech utterance of duration 1.5s can be identified in 1.8ms, while utilizing 70% of the resources. This further demonstrates the suitability of our approach for real-time applications on hand-held devices.

international symposium on quality electronic design | 2012

A top-down design methodology using virtual platforms for concept development

Mohit Shah; Brian R. Mears; Chaitali Chakrabarti; Andreas Spanias

Virtual platforms are widely used for system-level modeling, design and simulation. In this paper, we propose a virtual platform-based, top-down, system-level design methodology for developing and testing hardware/software right from the concept level and even before the architecture is finalized. The methodology is based on using tools such as QEMU, SystemC and TLM2.0 that starts with a functional, high-level description of the system and gradually refines the intricate architectural details. We present our results by testing a novel concept aimed at performing audio blogging. The system under consideration involves the design of a low-power wearable audio recorder, an Android application for user interface and a server for audio analysis. A virtual system consisting of three instances of QEMU and other tools was created to demonstrate the concept and to test this approach. Finally, we describe a suite of tools useful for quickly validating concepts and creating virtual platforms for early hardware/software codesign.

signal processing systems | 2015

A fixed-point neural network for keyword detection on resource constrained hardware

Mohit Shah; Jingcheng Wang; David T. Blaauw; Dennis Sylvester; Hun-Seok Kim; Chaitali Chakrabarti

Keyword detection is typically used as a front-end to trigger automatic speech recognition and spoken dialog systems. The detection engine needs to be continuously listening, which has strong implications on power and memory consumption. In this paper, we devise a neural network architecture for keyword detection and present a set of techniques for reducing the memory requirements in order to make the architecture suitable for resource constrained hardware. Specifically, a fixed-point implementation is considered; aggressively scaling down the precision of the weights lowers the memory compared to a naive floating-point implementation. For further optimization, a node pruning technique is proposed to identify and remove the least active nodes in a neural network. Experiments are conducted over 10 keywords selected from the Resource Management (RM) database. The trade-off between detection performance and memory is assessed for different weight representations. We show that a neural network with as few as 5 bits per weight yields a marginal and acceptable loss in performance, while requiring only 200 kilobytes (KB) of on-board memory and a latency of 150 ms. A hardware architecture using a single multiplier and a power consumption of less than 10mW is also presented.

international symposium on circuits and systems | 2014

A multi-modal approach to emotion recognition using undirected topic models

Mohit Shah; Chaitali Chakrabarti; Andreas Spanias

A multi-modal framework for emotion recognition using bag-of-words features and undirected, replicated softmax topic models is proposed here. Topic models ignore the temporal information between features, allowing them to capture the complex structure without a brute-force collection of statistics. Experiments are performed over face, speech and language features extracted from the USC IEMOCAP database. Performance on facial features yields an unweighted average recall of 60.71%, a relative improvement of 8.89% over state-of-the-art approaches. A comparable performance is achieved when considering only speech (57.39%) or a fusion of speech and face information (66.05%). Individually, each source is shown to be strong at recognizing either sadness (speech) or happiness (face) or neutral (language) emotions, while, a multi-modal fusion retains these properties and improves the accuracy to 68.92%. Implementation time for each source and their combination is provided. Results show that a turn of 1 second duration can be classified in approximately 666.65ms, thus making this method highly amenable for real-time implementation.

frontiers in education conference | 2010

Audio content-based feature extraction algorithms using J-DSP for arts, media and engineering courses

Mohit Shah; Gordon Wichern; Andreas Spanias; Harvey D. Thornburg

J-DSP is a java-based object-oriented online programming environment developed at Arizona State University for education and research. This paper presents a collection of interactive Java modules for the purpose of introducing undergraduate and graduate students to feature extraction in music and audio signals. These tools enable online simulations of different algorithms that are being used in applications related to content-based audio classification and Music Information Retrieval (MIR). The simulation software is accompanied by a series of computer experiments and exercises that can be used to provide hands-on training. Specific functions that have been developed include modules used widely such as Pitch Detection, Tonality, Harmonicity, Spectral Centroid and the Mel-Frequency Cepstral Coefficients (MFCC). This effort is part of a combined research and curriculum program funded by NSF CCLI that aims towards exposing students to advanced multidisciplinary concepts and research in signal processing.

Synthesis Lectures on Algorithms and Software in Engineering | 2016

Virtual Design of an Audio Lifelogging System:Tools for IoT Systems

Brian R. Mears; Mohit Shah

The availability of inexpensive, custom, highly integrated circuits is enabling some very powerful systems that bring together sensors, smart phones, wearables, cloud computing, and other technologies. To design these types of complex systems we are advocating a top-down simulation methodology to identify problems early. This approach enables software development to start prior to expensive chip and hardware development. We call the overall approach virtual design. This book explains why simulation has become important for chip design and provides an introduction to some of the simulation methods used. The audio lifelogging research project demonstrates the virtual design process in practice. The goals of this book are to: explain how silicon design has become more closely involved with system design; show how virtual design enables top down design; explain the utility of simulation at different abstraction levels; show how open source simulation software was used in audio lifelogging. The target audience for this book are faculty, engineers, and students who are interested in developing digital devices for Internet of Things (IoT) types of products.

asilomar conference on signals, systems and computers | 2014

A scalable feature learning and tag prediction framework for natural environment sounds

Prasanna Sattigeri; Jayaraman J. Thiagarajan; Mohit Shah; Karthikeyan Natesan Ramamurthy; Andreas Spanias

Building feature extraction approaches that can effectively characterize natural environment sounds is challenging due to the dynamic nature. In this paper, we develop a framework for feature extraction and obtaining semantic inferences from such data. In particular, we propose a new pooling strategy for deep architectures, that can preserve the temporal dynamics in the resulting representation. By constructing an ensemble of semantic embeddings, we employ an l1-reconstruction based prediction algorithm for estimating the relevant tags. We evaluate our approach on challenging environmental sound recognition datasets, and show that the proposed features outperform traditional spectral features.

signal processing systems | 2018

A Fixed-Point Neural Network Architecture for Speech Applications on Resource Constrained Hardware

Mohit Shah; Sairam Arunachalam; Jingcheng Wang; David T. Blaauw; Dennis Sylvester; Hun-Seok Kim; Jae-sun Seo; Chaitali Chakrabarti

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. These applications have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this paper, we design low cost neural network architectures for keyword detection and speech recognition. Wepresent techniques to reduce memory requirement by scaling down the precision of weight and biases without compromising on the detection/recognition performance. Experiments conducted on the Resource Management (RM) database show that for the keyword detection neural network, representing the weights by 5 bits results in a 6 fold reduction in memory compared to a floating point implementation with very little loss in performance. Similarly, for the speech recognition neural network, representing the weights by 6 bits results in a 5 fold reduction in memory while maintaining an error rate similar to a floating point implementation. Preliminary results in 40nm TSMC technology show that the networks have fairly small power consumption: 11.12mW for the keyword detection network and 51.96mW for the speech recognition network, making these designs suitable for mobile devices.

2010 Annual Conference & Exposition | 2010