Carolina Parada | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carolina Parada is active.

Explore More

Publication

Featured researches published by Carolina Parada.

international conference on acoustics, speech, and signal processing | 2014

Small-footprint keyword spotting using deep neural networks

Guoguo Chen; Carolina Parada; Georg Heigold

Our application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision. To meet these requirements, we propose a simple approach based on deep neural networks. A deep neural network is trained to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method producing a final confidence score. Keyword recognition results achieve 45% relative improvement with respect to a competitive Hidden Markov Model-based system, while performance in the presence of babble noise shows 39% relative improvement.

ieee automatic speech recognition and understanding workshop | 2009

Query-by-example Spoken Term Detection For OOV terms

Carolina Parada; Abhinav Sethy; Bhuvana Ramabhadran

The goal of Spoken Term Detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system [1] to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units [2]. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using Actual Term-Weighted Value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR systems lexicon.

spoken language technology workshop | 2010

Query language modeling for voice search

Ciprian Chelba; Johan Schalkwyk; Thorsten Brants; Vida Ha; Boulos Harb; Will Neveitt; Carolina Parada; Peng Xu

The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1% for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales—USA, Britain and Australia.

international conference on acoustics, speech, and signal processing | 2016

Personalized speech recognition on mobile devices

Ian McGraw; Rohit Prabhavalkar; Raziel Alvarez; Montse Gonzalez Arenas; Kanishka Rao; David Rybach; Ouais Alsharif; Hasim Sak; Alexander H. Gruenstein; Francoise Beaufays; Carolina Parada

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

international conference on acoustics, speech, and signal processing | 2015

Query-by-example keyword spotting using long short-term memory networks

Guoguo Chen; Carolina Parada; Tara N. Sainath

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction at 0.5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

international conference on acoustics, speech, and signal processing | 2010

Balancing false alarms and hits in Spoken Term Detection

Carolina Parada; Abhinav Sethy; Bhuvana Ramabhadran

This paper presents methods to improve retrieval of Out-Of-Vocabulary (OOV) terms in a Spoken Term Detection (STD) system. We demonstrate that automated tagging of OOV regions helps to reduce false alarms while incorporating phonetic confusability increases the hits. Additional features that boost the probability of a hit in accordance with the number of neighboring hits for the same query and query-length normalization also improve the overall performance of the spokenterm detection system. We show that these methods can be combined effectively to provide a relative improvement of 21% in Average Term Weighted Value (ATWV) on a 100-hour corpus with 1290 OOV-only queries and 2% relative on the NIST 2006 STD task, where only 16 of the 1107 queries were OOV terms. Lastly, we present results to show that the proposed methods are general enough to work well in query-by-example based spoken-term detection, and in mismatched situations when the representation of the index being searched through and the queries are not generated by the same system.

international conference on acoustics, speech, and signal processing | 2015

Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks

Rohit Prabhavalkar; Raziel Alvarez; Carolina Parada; Preetum Nakkiran; Tara N. Sainath

We explore techniques to improve the robustness of small-footprint keyword spotting models based on deep neural networks (DNNs) in the presence of background noise and in far-field conditions. We find that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control (AGC) that estimates the levels of both speech and background noise. Further, we find that these techniques allow us to achieve competitive performance, even when applied to DNNs with an order of magnitude fewer parameters than our base-line.

conference of the international speech communication association | 2016

Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection

Ruben Zazo; Tara N. Sainath; Gabor Simko; Carolina Parada

Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. In addition, using a CLDNN, which takes advantage of both frequency modeling with the CNN and temporal modeling with LSTM, is a much better model for VAD compared to the DNN. The proposed system achieves over 78% relative improvement in False Alarms (FA) at the operating point of 2% False Rejects (FR) on both clean and noisy conditions compared to a DNN of comparable size trained with log-mel features. In addition, we study the impact of the model size and the learned features to provide a better understanding of the proposed architecture.

text speech and dialogue | 2008

Toward the Ultimate ASR Language Model

Frederick Jelinek; Carolina Parada

The n-gram model is standard for large vocabulary speech recognizers. Many attempts were made to improve on it. Language models were proposed based on grammatical analysis, artificial neural networks, random forests, etc. While the latter give somewhat better recognition results than the n-gram model, they are not practical, particularly when large training data bases (e.g., from world wide web) are available. So should language model research be abandoned as a hopeless endeavor? This talk will discuss a plan to determine how large a decrease in recognition error rate is conceivable, and propose a game-based method to determine what parameters the ultimate language model should depend on.

north american chapter of the association for computational linguistics | 2010