Vincent Vanhoucke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vincent Vanhoucke is active.

Explore More

Publication

Featured researches published by Vincent Vanhoucke.

computer vision and pattern recognition | 2015

Going deeper with convolutions

Christian Szegedy; Wei Liu; Yangqing Jia; Pierre Sermanet; Scott E. Reed; Dragomir Anguelov; Dumitru Erhan; Vincent Vanhoucke; Andrew Rabinovich

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

IEEE Signal Processing Magazine | 2012

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton; Li Deng; Dong Yu; George E. Dahl; Abdel-rahman Mohamed; Navdeep Jaitly; Andrew W. Senior; Vincent Vanhoucke; Patrick Nguyen; Tara N. Sainath; Brian Kingsbury

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

computer vision and pattern recognition | 2016

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy; Vincent Vanhoucke; Sergey Ioffe; Jonathon Shlens; Zbigniew Wojna

Convolutional networks are at the core of most state of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21:2% top-1 and 5:6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3:5% top-5 error and 17:3% top-1 error on the validation set and 3:6% top-5 error on the official test set.

international conference on acoustics, speech, and signal processing | 2013

On rectified linear units for speech processing

Matthew D. Zeiler; Marc'Aurelio Ranzato; Rajat Monga; Mark Mao; Ke Yang; Quoc V. Le; Patrick Nguyen; Andrew W. Senior; Vincent Vanhoucke; Jeffrey Dean; Geoffrey E. Hinton

Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data.

international conference on acoustics, speech, and signal processing | 2013

Multilingual acoustic models using distributed deep neural networks

Georg Heigold; Vincent Vanhoucke; Andrew W. Senior; Patrick Nguyen; Marc'Aurelio Ranzato; Matthieu Devin; Jeffrey Dean

Todays speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resource-scarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual).

british machine vision conference | 2015

Real-Time Pedestrian Detection With Deep Network Cascades

Anelia Angelova; Alex Krizhevsky; Vincent Vanhoucke; Abhijit Ogale; Dave Ferguson

We present a new real-time approach to object detection that exploits the efficiency of cascade classifiers with the accuracy of deep neural networks. Deep networks have been shown to excel at classification tasks, and their ability to operate on raw pixel input without the need to design special features is very appealing. However, deep nets are notoriously slow at inference time. In this paper, we propose an approach that cascades deep nets and fast features, that is both very fast and very accurate. We apply it to the challenging task of pedestrian detection. Our algorithm runs in real-time at 15 frames per second. The resulting approach achieves a 26.2% average miss rate on the Caltech Pedestrian detection benchmark, which is competitive with the very best reported results. It is the first work we are aware of that achieves very high accuracy while running in real-time.

international conference on acoustics, speech, and signal processing | 2014

Asynchronous stochastic optimization for sequence training of deep neural networks

Georg Heigold; Erik McDermott; Vincent Vanhoucke; Andrew W. Senior; Michiel Bacchiani

This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.

international conference on acoustics, speech, and signal processing | 2013

Multiframe deep neural networks for acoustic modeling

Vincent Vanhoucke; Matthieu Devin; Georg Heigold

Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations.

international conference on robotics and automation | 2015

Pedestrian detection with a Large-Field-Of-View deep network

Anelia Angelova; Alex Krizhevsky; Vincent Vanhoucke

Pedestrian detection is of crucial importance to autonomous driving applications. Methods based on deep learning have shown significant improvements in accuracy, which makes them particularly suitable for applications, such as pedestrian detection, where reducing the miss rate is very important. Although they are accurate, their runtime has been at best in seconds per image, which makes them not practical for onboard applications. We present a Large-Field-Of-View (LFOV) deep network for pedestrian detection, that can achieve high accuracy and is designed to make deep networks work faster for detection problems. The idea of the proposed Large-Field-of-View deep network is to learn to make classification decisions simultaneously and accurately at multiple locations. The LFOV network processes larger image areas at much faster speeds than typical deep networks have been able to, and can intrinsically reuse computations. Our pedestrian detection solution, which is a combination of a LFOV network and a standard deep network, works at 280 ms per image on GPU and achieves 35.85 average miss rate on the Caltech Pedestrian Detection Benchmark.

IEEE Transactions on Speech and Audio Processing | 2004

Mixtures of inverse covariances

Vincent Vanhoucke; Ananth Sankar

We describe a model which approximates full covariances in a Gaussian mixture while reducing significantly both the number of parameters to estimate and the computations required to evaluate the Gaussian likelihoods. In this model, the inverse covariance of each Gaussian in the mixture is expressed as a linear combination of a small set of prototype matrices that are shared across components. In addition, we demonstrate the benefits of a subspace-factored extension of this model when representing independent or near-independent product densities. We present a maximum likelihood estimation algorithm for these models, as well as a practical method for implementing it. We show through experiments performed on a variety of speech recognition tasks that this model significantly outperforms a diagonal covariance model, while using far fewer Gaussian-specific parameters. Experiments also demonstrate that a better speed/accuracy tradeoff can be achieved on a real-time speech recognition system.

Explore More