Iulian Vlad Serban | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Iulian Vlad Serban is active.

Explore More

Publication

Featured researches published by Iulian Vlad Serban.

empirical methods in natural language processing | 2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Chia-Wei Liu; Ryan Lowe; Iulian Vlad Serban; Michael Noseworthy; Laurent Charlin; Joelle Pineau

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a models generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

meeting of the association for computational linguistics | 2016

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus

Iulian Vlad Serban; Alberto García-Durán; Caglar Gulcehre; Sungjin Ahn; Sarath Chandar; Aaron C. Courville; Yoshua Bengio

Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the question-generation model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear comparable in quality to real human-generated questions.

meeting of the association for computational linguistics | 2017

Towards an automatic Turing test: Learning to evaluate dialogue responses

Ryan Lowe; Michael Noseworthy; Iulian Vlad Serban; Nicolas Angelard-Gontier; Yoshua Bengio; Joelle Pineau

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM models predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.

annual meeting of the special interest group on discourse and dialogue | 2016

On the Evaluation of Dialogue Systems with Next Utterance Classification.

Ryan Lowe; Iulian Vlad Serban; Michael Noseworthy; Laurent Charlin; Joelle Pineau

An open challenge in constructing dialogue systems is developing methods for automatically learning dialogue strategies from large amounts of unlabelled data. Recent work has proposed Next-Utterance-Classification (NUC) as a surrogate task for building dialogue systems from text data. In this paper we investigate the performance of humans on this task to validate the relevance of NUC as a method of evaluation. Our results show three main findings: (1) humans are able to correctly classify responses at a rate much better than chance, thus confirming that the task is feasible, (2) human performance levels vary across task domains (we consider 3 datasets) and expertise levels (novice vs experts), thus showing that a range of performance is possible on this type of task, (3) automated dialogue systems built using state-of-the-art machine learning methods have similar performance to the human novices, but worse than the experts, thus confirming the utility of this class of tasks for driving further research in automated dialogue systems.

Archive | 2018

The First Conversational Intelligence Challenge

Mikhail Burtsev; Varvara Logacheva; Valentin Malykh; Iulian Vlad Serban; Ryan Lowe; Shrimai Prabhumoye; Alan W. Black; Alexander I. Rudnicky; Yoshua Bengio

The first Conversational Intelligence Challenge was conducted over 2017 with finals at NIPS conference. The challenge IS aimed at evaluating the state of the art in non-goal-driven dialogue systems (chatbots) and collecting a large dataset of human-to-machine and human-to-human conversations manually labelled for quality. We established a task for formal human evaluation of chatbots that allows to test capabilities of chatbot in topic-oriented dialogue. Instead of traditional chit-chat, participating systems and humans were given a task to discuss a short text. Ten dialogue systems participated in the competition. The majority of them combined multiple conversational models such as question answering and chit-chat systems to make conversations more natural. The evaluation of chatbots was performed by human assessors. Almost 1,000 volunteers were attracted and over 4,000 dialogues were collected during the competition. Final score of the dialogue quality for the best bot was 2.7 compared to 3.8 for human. This demonstrates that current technology allows supporting dialogue on a given topic but with quality significantly lower than that of human. To close this gap we plan to continue the experiments by organising the next conversational intelligence competition. This future work will benefit from the data we collected and dialogue systems that we made available after the competition presented in the paper.

Archive | 2018

Introduction to NIPS 2017 Competition Track

Sergio Escalera; Markus Weimer; Mikhail Burtsev; Valentin Malykh; Varvara Logacheva; Ryan Lowe; Iulian Vlad Serban; Yoshua Bengio; Alexander I. Rudnicky; Alan W. Black; Shrimai Prabhumoye; Łukasz Kidziński; Sharada Prasanna Mohanty; Carmichael F. Ong; Jennifer L. Hicks; Sergey Levine; Marcel Salathé; Scott L. Delp; Iker Huerga; Alexander Grigorenko; Leifur Thorbergsson; Anasuya Das; Kyla Nemitz; Jenna Sandker; Stephen King; Alexander S. Ecker; Leon A. Gatys; Matthias Bethge; Jordan L. Boyd-Graber; Shi Feng

Competitions have become a popular tool in the data science community to solve hard problems, assess the state of the art and spur new research directions. Companies like Kaggle and open source platforms like Codalab connect people with data and a data science problem to those with the skills and means to solve it. Hence, the question arises: What, if anything, could NIPS add to this rich ecosystem?

national conference on artificial intelligence | 2016