Gilles Stoltz
École Normale Supérieure
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gilles Stoltz.
algorithmic learning theory | 2009
Sébastien Bubeck; Rémi Munos; Gilles Stoltz
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time.We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration-exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.
Annals of Statistics | 2013
Olivier Cappé; Aurélien Garivier; Odalric-Ambrym Maillard; Rémi Munos; Gilles Stoltz
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
IEEE Transactions on Information Theory | 2005
Nicolò Cesa-Bianchi; Gábor Lugosi; Gilles Stoltz
We investigate label efficient prediction, a variant, proposed by Helmbold and Panizza, of the problem of prediction with expert advice. In this variant, the forecaster, after guessing the next element of the sequence to be predicted, does not observe its true value unless he asks for it, which he cannot do too often. We determine matching upper and lower bounds for the best possible excess prediction error, with respect to the best possible constant predictor, when the number of allowed queries is fixed. We also prove that Hannan consistency, a fundamental property in game-theoretic prediction models, can be achieved by a forecaster issuing a number of queries growing to infinity at a rate just slightly faster than logarithmic in the number of prediction rounds.
Theoretical Computer Science | 2011
Sébastien Bubeck; Rémi Munos; Gilles Stoltz
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.
Games and Economic Behavior | 2007
Gilles Stoltz; Gábor Lugosi
Hart and Schmeidlers extension of correlated equilibrium to games with infinite sets of strategies is studied. General properties of the set of correlated equilibria are described. It is shown that, just like for finite games, if all players play according to an appropriate regret-minimizing strategy then the empirical frequencies of play converge to the set of correlated equilibria whenever the strategy sets are convex and compact.
Mathematics of Operations Research | 2006
Nicolò Cesa-Bianchi; Gábor Lugosi; Gilles Stoltz
We consider repeated games in which the player, instead of observing the action chosen by the opponent in each game round, receives a feedback generated by the combined choice of the two players. We study Hannan consistent players for these games, that is, randomized playing strategies whose per-round regret vanishes with probability one as the number of game rounds goes to infinity. We prove a general lower bound for the convergence rate of the regret, and exhibit a specific strategy that attains this rate for any game for which a Hannan consistent player exists.
Machine Learning | 2005
Gilles Stoltz; Gábor Lugosi
This paper extends the game-theoretic notion of internal regret to the case of on-line potfolio selection problems. New sequential investment strategies are designed to minimize the cumulative internal regret for all possible market behaviors. Some of the introduced strategies, apart from achieving a small internal regret, achieve an accumulated wealth almost as large as that of the best constantly rebalanced portfolio. It is argued that the low-internal-regret property is related to stability and experiments on real stock exchange data demonstrate that the new strategies achieve better returns compared to some known algorithms.
Mathematics of Operations Research | 2010
Shie Mannor; Gilles Stoltz
We provide yet another proof of the existence of calibrated forecasters; it has two merits. First, it is valid for an arbitrary finite number of outcomes. Second, it is short and simple and it follows from a direct application of Blackwells approachability theorem to a carefully chosen vector-valued payoff function and convex target set. Our proof captures the essence of existing proofs based on approachability (e.g., the proof by Foster [Foster, D. 1999. A proof of calibration via Blackwells approachability theorem. Games Econom. Behav.29 73--78] in the case of binary outcomes) and highlights the intrinsic connection between approachability and calibration.
algorithmic learning theory | 2011
Sébastien Bubeck; Gilles Stoltz; Jia Yuan Yu
We consider the setting of stochastic bandit problems with a continuum of arms indexed by [0, 1]d. We first point out that the strategies considered so far in the literature only provided theoretical guarantees of the form: given some tuning parameters, the regret is small with respect to a class of environments that depends on these parameters. This is however not the right perspective, as it is the strategy that should adapt to the specific bandit environment at hand, and not the other way round. Put differently, an adaptation issue is raised. We solve it for the special case of environments whose mean-payoff functions are globally Lipschitz. More precisely, we show that the minimax optimal orders of magnitude Ld/(d+2) T(d+1)/(d+2) of the regret bound over T time instances against an environment whose mean-payoff function f is Lipschitz with constant L can be achieved without knowing L or T in advance. This is in contrast to all previously known strategies, which require to some extent the knowledge of L to achieve this performance guarantee.
Mathematics of Operations Research | 2008
Gábor Lugosi; Shie Mannor; Gilles Stoltz
We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possible average reward. It was Rustichini (1999) who first proved the existence of such consistent predictors. The forecasters presented here offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback, these rates are optimal up to logarithmic terms.