Quoc V. Le | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Quoc V. Le is active.

Explore More

Publication

Featured researches published by Quoc V. Le.

computer vision and pattern recognition | 2011

Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

Quoc V. Le; Will Y. Zou; Serena Yeung; Andrew Y. Ng

Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/∼wzou/

international joint conference on natural language processing | 2015

Addressing the Rare Word Problem in Neural Machine Translation

Thang Luong; Ilya Sutskever; Quoc V. Le; Oriol Vinyals; Wojciech Zaremba

Neural Machine Translation (NMT) is a new approach to machine translation that has shown promising results that are comparable to traditional approaches. A significant weakness in conventional NMT systems is their inability to correctly translate very rare words: end-to-end NMTs tend to have relatively small vocabularies with a single unk symbol that represents every possible out-of-vocabulary (OOV) word. In this paper, we propose and implement an effective technique to address this problem. We train an NMT system on data that is augmented by the output of a word alignment algorithm, allowing the NMT system to emit, for each OOV word in the target sentence, the position of its corresponding word in the source sentence. This information is later utilized in a post-processing step that translates every OOV word using a dictionary. Our experiments on the WMT’14 English to French translation task show that this method provides a substantial improvement of up to 2.8 BLEU points over an equivalent NMT system that does not use this technique. With 37.5 BLEU points, our NMT system is the first to surpass the best result achieved on a WMT’14 contest task.

international conference on acoustics, speech, and signal processing | 2013

On rectified linear units for speech processing

Matthew D. Zeiler; Marc'Aurelio Ranzato; Rajat Monga; Mark Mao; Ke Yang; Quoc V. Le; Patrick Nguyen; Andrew W. Senior; Vincent Vanhoucke; Jeffrey Dean; Geoffrey E. Hinton

Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data.

knowledge discovery and data mining | 2007

A scalable modular convex solver for regularized risk minimization

Choon Hui Teo; Alexander J. Smola; S. V. N. Vishwanathan; Quoc V. Le

A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as l1 and l2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox.

international conference on acoustics, speech, and signal processing | 2016

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

William Chan; Navdeep Jaitly; Quoc V. Le; Oriol Vinyals

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.

international conference on robotics and automation | 2009

High-accuracy 3D sensing for mobile manipulation: Improving object detection and door opening

Morgan Quigley; Siddharth Batra; Stephen Gould; Ellen Klingbeil; Quoc V. Le; Ashley Wellman; Andrew Y. Ng

High-resolution 3D scanning can improve the performance of object detection and door opening, two tasks critical to the operation of mobile manipulators in cluttered homes and workplaces. We discuss how high-resolution depth information can be combined with visual imagery to improve the performance of object detection beyond what is (currently) achievable with 2D images alone, and we present door-opening and inventory-taking experiments.

intelligent robots and systems | 2010

Grasping novel objects with depth segmentation

Deepak Rao; Quoc V. Le; Thanathorn Phoka; Morgan Quigley; Attawith Sudsang; Andrew Y. Ng

We consider the task of grasping novel objects and cleaning fairly cluttered tables with many novel objects. Recent successful approaches employ machine learning algorithms to identify points on the scene that the robot should grasp. In this paper, we show that the task can be significantly simplified by using segmentation, especially with depth information. A supervised localization method is employed to select graspable segments. We also propose a shape completion and grasp planner method which takes partial 3D information and plans the most stable grasping strategy. Extensive experiments on our robot demonstrate the effectiveness of our approach.

international conference on robotics and automation | 2010

Learning to grasp objects with multiple contact points

Quoc V. Le; David Kamm; Arda F. Kara; Andrew Y. Ng

We consider the problem of grasping novel objects and its application to cleaning a desk. A recent successful approach applies machine learning to learn one grasp point in an image and a point cloud. Although those methods are able to generalize to novel objects, they yield suboptimal results because they rely on motion planner for finger placements. In this paper, we extend their method to accommodate grasps with multiple contacts. This approach works well for many human-made objects because it models the way we grasp objects. To further improve the grasping, we also use a method that learns the ranking between candidates. The experiments show that our method is highly effective compared to a state-of-the-art competitor.

international conference on machine learning | 2005

Heteroscedastic Gaussian process regression

Quoc V. Le; Alexander J. Smola; Stéphane Canu

This paper presents an algorithm to estimate simultaneously both mean and variance of a non parametric regression problem. The key point is that we are able to estimate variance locally unlike standard Gaussian Process regression or SVMs. This means that our estimator adapts to the local noise. The problem is cast in the setting of maximum a posteriori estimation in exponential families. Unlike previous work, we obtain a convex optimization problem which can be solved via Newtons method.

meeting of the association for computational linguistics | 2017

Neural symbolic machines: Learning semantic parsers on freebase with weak supervision

Chen Liang; Jonathan Berant; Quoc V. Le; Kenneth D. Forbus; Ni Lao

Harnessing the statistical power of neural networks to perform language understanding and symbolic reasoning is difficult, when it requires executing efficient discrete operations against a large knowledge-base. In this work, we introduce a Neural Symbolic Machine, which contains (a) a neural “programmer”, i.e., a sequence-to-sequence model that maps language utterances to programs and utilizes a key-variable memory to handle compositionality (b) a symbolic “computer”, i.e., a Lisp interpreter that performs program execution, and helps find good programs by pruning the search space. We apply REINFORCE to directly optimize the task reward of this structured prediction problem. To train with weak supervision and improve the stability of REINFORCE, we augment it with an iterative maximum-likelihood training process. NSM outperforms the state-of-the-art on the WebQuestionsSP dataset when trained from question-answer pairs only, without requiring any feature engineering or domain-specific knowledge.

Explore More