Marc Lanctot | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marc Lanctot is active.

Explore More

Publication

Featured researches published by Marc Lanctot.

Nature | 2016

Mastering the game of Go with deep neural networks and tree search

David Silver; Aja Huang; Chris J. Maddison; Arthur Guez; Laurent Sifre; George van den Driessche; Julian Schrittwieser; Ioannis Antonoglou; Veda Panneershelvam; Marc Lanctot; Sander Dieleman; Dominik Grewe; John Nham; Nal Kalchbrenner; Ilya Sutskever; Timothy P. Lillicrap; Madeleine Leach; Koray Kavukcuoglu; Thore Graepel; Demis Hassabis

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

neural information processing systems | 2009

Monte Carlo Sampling for Regret Minimization in Extensive Games

Marc Lanctot; Kevin Waugh; Martin Zinkevich; Michael H. Bowling

Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFRs bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games.

computational intelligence and games | 2007

Adversarial Planning Through Strategy Simulation

Frantisek Sailer; Michael Buro; Marc Lanctot

Adversarial planning in highly complex decision domains, such as modern video games, has not yet received much attention from AI researchers. In this paper, we present a planning framework that uses strategy simulation in conjunction with Nash-equilibrium strategy approximation. We apply this framework to an army deployment problem in a real-time strategy game setting and present experimental results that indicate a performance gain over the scripted strategies that the system is built on. This technique provides an automated way of increasing the decision quality of scripted AI systems and is therefore ideally suited for video games and combat simulators

Journal of Artificial Intelligence Research | 2011

Computing approximate Nash Equilibria and robust best-responses using sampling

Marc J. V. Ponsen; Steven de Jong; Marc Lanctot

This article discusses two contributions to decision-making in complex partially observable stochastic games. First, we apply two state-of-the-art search techniques that use Monte-Carlo sampling to the task of approximating a Nash-Equilibrium (NE) in such games, namely Monte-Carlo Tree Search (MCTS) and Monte-Carlo Counterfactual Regret Minimization (MCCFR). MCTS has been proven to approximate a NE in perfect-information games. We show that the algorithm quickly finds a reasonably strong strategy (but not a NE) in a complex imperfect information game, i.e. Poker. MCCFR on the other hand has theoretical NE convergence guarantees in such a game. We apply MCCFR for the first time in Poker. Based on our experiments, we may conclude that MCTS is a valid approach if one wants to learn reasonably strong strategies fast, whereas MCCFR is the better choice if the quality of the strategy is most important. Our second contribution relates to the observation that a NE is not a best response against players that are not playing a NE. We present Monte-Carlo Restricted Nash Response (MCRNR), a sample-based algorithm for the computation of restricted Nash strategies. These are robust bestresponse strategies that (1) exploit non-NE opponents more than playing a NE and (2) are not (overly) exploitable by other strategies. We combine the advantages of two state-of-the-art algorithms, i.e. MCCFR and Restricted Nash Response (RNR). MCRNR samples only relevant parts of the game tree. We show that MCRNR learns quicker than standard RNR in smaller games. Also we show in Poker that MCRNR learns robust best-response strategies fast, and that these strategies exploit opponents more than playing a NE does.

genetic and evolutionary computation conference | 2016

Convolution by Evolution: Differentiable Pattern Producing Networks

Chrisantha Fernando; Dylan Banarse; Malcolm Reynolds; Frederic Besse; David Pfau; Max Jaderberg; Marc Lanctot; Daan Wierstra

In this work we introduce a differentiable version of the Compositional Pattern Producing Network, called the DPPN. Unlike a standard CPPN, the topology of a DPPN is evolved but the weights are learned. A Lamarckian algorithm, that combines evolution and learning, produces DPPNs to reconstruct an image. Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters. The regularization ability of the DPPN allows it to rediscover (approximate) convolutional network architectures embedded within a fully connected architecture. Such convolutional architectures are the current state of the art for many computer vision applications, so it is satisfying that DPPNs are capable of discovering this structure rather than having to build it in by design. DPPNs exhibit better generalization when tested on the Omniglot dataset after being trained on MNIST, than directly encoded fully connected autoencoders. DPPNs are therefore a new framework for integrating learning and evolution.

computational intelligence and games | 2014

Monte Carlo Tree Search with heuristic evaluations using implicit minimax backups

Marc Lanctot; Mark H. M. Winands; Tom Pepels; Nathan R. Sturtevant

Monte Carlo Tree Search (MCTS) has improved the performance of game engines in domains such as Go, Hex, and general game playing. MCTS has been shown to outperform classic αβ search in games where good heuristic evaluations are difficult to obtain. In recent years, combining ideas from traditional minimax search in MCTS has been shown to be advantageous in some domains, such as Lines of Action, Amazons, and Breakthrough. In this paper, we propose a new way to use heuristic evaluations to guide the MCTS search by storing the two sources of information, estimated win rates and heuristic evaluations, separately. Rather than using the heuristic evaluations to replace the playouts, our technique backs them up implicitly during the MCTS simulations. These minimax values are then used to guide future simulations. We show that using implicit minimax backups leads to stronger play performance in Kalah, Breakthrough, and Lines of Action.

Archive | 2013

Monte carlo sampling and regret minimization for equilibrium computation and decision-making in large extensive form games

Michael H. Bowling; Marc Lanctot

In this thesis, we investigate the problem of decision-making in large two-player zero-sum games using Monte Carlo sampling and regret minimization methods. We demonstrate four major contributions. The first is Monte Carlo Counterfactual Regret Minimization (MC-CFR): a generic family of sample-based algorithms that compute near-optimal equilibrium strategies. Secondly, we develop a theory for applying counterfactual regret minimization to a generic subset of imperfect recall games as well as a lossy abstraction mechanism for reducing the size of very large games. Thirdly, we describe Monte Carlo Minimax Search (MCMS): an adversarial search algorithm based on *-Minimax that uses sparse sampling. We then present variance reduction techniques that can be used in these settings, with a focused application to Monte Carlo Tree Search (MCTS). We thoroughly evaluate our algorithms in practice using several different domains and sampling strategies.

computer games | 2013

Monte Carlo Tree Search in Simultaneous Move Games with Applications to Goofspiel

Marc Lanctot; Viliam Lisý; Mark H. M. Winands

Monte Carlo Tree Search (MCTS) has become a widely popular sampled-based search algorithm for two-player games with perfect information. When actions are chosen simultaneously, players may need to mix between their strategies. In this paper, we discuss the adaptation of MCTS to simultaneous move games. We introduce a new algorithm, Online Outcome Sampling (OOS), that approaches a Nash equilibrium strategy over time. We compare both head-to-head performance and exploitability of several MCTS variants in Goofspiel. We show that regret matching and OOS perform best and that all variants produce less exploitable strategies than UCT.

computer games | 2014

Minimizing simple and cumulative regret in Monte-Carlo Tree Search

Tom Pepels; Tristan Cazenave; Mark H. M. Winands; Marc Lanctot

Regret minimization is important in both the Multi-Armed Bandit problem and Monte-Carlo Tree Search (MCTS). Recently, simple regret, i.e., the regret of not recommending the best action, has been proposed as an alternative to cumulative regret in MCTS, i.e., regret accumulated over time. Each type of regret is appropriate in different contexts. Although the majority of MCTS research applies the UCT selection policy for minimizing cumulative regret in the tree, this paper introduces a new MCTS variant, Hybrid MCTS (H-MCTS), which minimizes both types of regret in different parts of the tree. H-MCTS uses SHOT, a recursive version of Sequential Halving, to minimize simple regret near the root, and UCT to minimize cumulative regret when descending further down the tree. We discuss the motivation for this new search technique, and show the performance of H-MCTS in six distinct two-player games: Amazons, AtariGo, Ataxx, Breakthrough, NoGo, and Pentalath.

european conference on artificial intelligence | 2014

Quality-based rewards for Monte-Carlo tree search simulations

Tom Pepels; Mandy J. W. Tak; Marc Lanctot; Mark H. M. Winands

Monte-Carlo Tree Search is a best-first search technique based on simulations to sample the state space of a decision-making problem. In games, positions are evaluated based on estimates obtained from rewards of numerous randomized play-outs. Generally, rewards from play-outs are discrete values representing the outcome of the game (loss, draw, or win), e.g., r ∈ {-1, 0, 1}, which are backpropagated from expanded leaf nodes to the root node. However, a play-out may provide additional information. In this paper, we introduce new measures for assessing the a posteriori quality of a simulation. We show that altering the rewards of play-outs based on their assessed quality improves results in six distinct two-player games and in the General Game Playing agent CADIAPLAYER. We propose two specific enhancements, the Relative Bonus and Qualitative Bonus. Both are used as control variates, a variance reduction method for statistical simulation. Relative Bonus is based on the number of moves made during a simulation and Qualitative Bonus relies on a domain-dependent assessment of the games terminal state. We show that the proposed enhancements, both separate and combined, lead to significant performance increases in the domains discussed.

Explore More