Nicolas Ballas
Université de Montréal
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nicolas Ballas.
international conference on computer vision | 2015
Li Yao; Atousa Torabi; Kyunghyun Cho; Nicolas Ballas; Chris Pal; Hugo Larochelle; Aaron C. Courville
Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description model. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.
conference on multimedia modeling | 2015
Alexandru Lucian Ginsca; Adrian Popescu; Hervé Le Borgne; Nicolas Ballas; Phong D. Vo; Ioannis Kanellos
The availability of large annotated visual resources, such as ImageNet, recently led to important advances in image mining tasks. However, the manual annotation of such resources is cumbersome. Exploiting Web datasets as a substitute or complement is an interesting but challenging alternative. The main problems to solve are the choice of the initial dataset and the noisy character of Web text-image associations. This article presents an approach which first leverages Flickr groups to automatically build a comprehensive visual resource and then exploits it for image retrieval. Flickr groups are an interesting candidate dataset because they cover a wide range of user interests. To reduce initial noise, we introduce innovative and scalable image reranking methods. Then, we learn individual visual models for 38,500 groups using a low-level image representation. We exploit off-the-shelf linear models to ensure scalability of the learning and prediction steps. Finally, Semfeat image descriptions are obtained by concatenating prediction scores of individual models and by retaining only the most salient responses. To provide a comparison with a manually created resource, a similar pipeline is applied to ImageNet. Experimental validation is conducted on the ImageCLEF Wikipedia Retrieval 2010 benchmark, showing competitive results that demonstrate the validity of our approach.
computer vision and pattern recognition | 2017
Tegan Maharaj; Nicolas Ballas; Anna Rohrbach; Aaron C. Courville; Chris Pal
While deep convolutional neural networks frequently approach or exceed human-level performance in benchmark tasks involving static images, extending this success to moving images is not straightforward. Video understanding is of interest for many applications, including content recommendation, prediction, summarization, event/object detection, and understanding human visual perception. However, many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features, the effects of increasing dataset size, the number of frames sampled, and of vocabulary size. We illustrate that: this task is not solvable by a language model alone, our model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level. We provide human evaluation for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgment. We suggest avenues for improving video models, and hope that the MovieFIB challenge can be useful for measuring and encouraging progress in this very interesting field.
british machine vision conference | 2016
Li Yao; Nicolas Ballas; Kyunghyun Cho; John R. Smith; Yoshua Bengio
The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.
international conference on learning representations | 2015
Adriana Romero; Nicolas Ballas; Samira Ebrahimi Kahou; Antoine Chassang; Carlo Gatta; Yoshua Bengio
international conference on learning representations | 2016
Nicolas Ballas; Li Yao; Chris Pal; Aaron C. Courville
international conference on learning representations | 2017
Tim Cooijmans; Nicolas Ballas; César Laurent; Caglar Gulcehre; Aaron C. Courville
international conference on learning representations | 2017
David Krueger; Tegan Maharaj; Janos Kramar; Mohammad Pezeshki; Nicolas Ballas; Nan Rosemary Ke; Anirudh Goyal; Yoshua Bengio; Aaron C. Courville; Chris Pal
international conference on machine learning | 2017
Devansh Arpit; Stanisław Jastrzębski; Nicolas Ballas; David Krueger; Emmanuel Bengio; Maxinder S. Kanwal; Tegan Maharaj; Asja Fischer; Aaron C. Courville; Yoshua Bengio; Simon Lacoste-Julien
arXiv: Machine Learning | 2015
Li Yao; Atousa Torabi; Kyunghyun Cho; Nicolas Ballas; Chris Pal; Hugo Larochelle; Aaron C. Courville