Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Saba Q. Yahyaa is active.

Publication


Featured researches published by Saba Q. Yahyaa.


international conference on agents and artificial intelligence | 2014

Knowledge Gradient for Multi-objective Multi-armed Bandit Algorithms

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

We extend knowledge gradient (KG) policy for the multi-objective multi-armed bandit problems to efficiently explore the Pareto optimal arms. We consider two partial order relationships to order the mean vectors, i.e. Pareto and scalarized functions. Pareto KG finds the optimal arms using Pareto search, while the scalarizations-KG transform the multi-objectives arms into one-objective arm to find the optimal arms. To measure the performance of the proposed algorithms, we propose three regret measures. We compare the performance of knowledge gradient policy with UCB1 on a multi-objective multi-armed bandit problem, where KG outperforms UCB1.


international symposium on neural networks | 2014

The scalarized multi-objective multi-armed bandit problem: An empirical study of its exploration vs. exploitation tradeoff

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms. In case of the multi-objective MAB (or MOMAB)-problem, arms generate a vector of rewards, one per arm, instead of a single scalar reward. In this paper, we extend the KG-policy to address multi-objective problems using scalarization functions that transform reward vectors into single scalar reward. We consider different scalarization functions and we call the corresponding class of algorithms scalarized KG. We compare the resulting algorithms with the corresponding variants of the multi-objective UCBl-policy (MO-UCB1) on a number of MOMAB-problems where the reward vectors are drawn from a multivariate normal distribution. We compare experimentally the exploration versus exploitation trade-off and we conclude that scalarized-KG outperforms MO-UCB1 on these test problems.


international conference on agents and artificial intelligence | 2015

Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

In the stochastic multi-objective multi-armed bandit (MOMAB), arms generate a vector of stochastic normal rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) using Pareto dominance relation. The goal of an agent is to find the Pareto front. To find the optimal arms, the agent can use linear scalarization function that transforms a multi-objective problem into a single problem by summing the weighted objectives. Selecting the weights is crucial, since different weights will result in selecting a different optimum arm from the Pareto front. Usually, a predefined weights set is used and this can be computational inefficient when different weights will optimize the same Pareto optimal arm and arms in the Pareto front are not identified. In this paper, we propose a number of techniques that adapt the weights on the fly in order to ameliorate the performance of the scalarized MOMAB. We use genetic and adaptive scalarization functions from multi-objective optimization to generate new weights. We propose to use Thompson sampling policy to select frequently the weights that identify new arms on the Pareto front. We experimentally show that Thompson sampling improves the performance of the genetic and adaptive scalarization functions. All the proposed techniques improves the performance of the standard scalarized MOMAB with a fixed set of weights.


ieee symposium on adaptive dynamic programming and reinforcement learning | 2014

Annealing-pareto multi-objective multi-armed bandit algorithm

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter ϵt in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.


trans. computational collective intelligence | 2015

Scalarized and pareto knowledge gradient for multi-objective multi-armed bandits

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

A multi-objective multi-armed bandit (MOMAB) problem is a sequential decision process with stochastic reward vectors. We extend knowledge gradient (KG) policy to the MOMAB problem, and we propose Pareto-KG and scalarized-KG algorithms. The Pareto-KG trades off between exploration and exploitation by combining KG policy with Pareto dominance relations. The scalarized-KG makes use of a linear or non-linear scalarization function to convert the MOMAB problem into a single-objective multi-armed bandit problem and uses KG policy to trade off between exploration and exploitation. To measure the performance of the proposed algorithms, we introduce three regret measures. We compare empirically the performance of the KG policy with UCB1 policy on a test suite of MOMAB problems with normal distributions. The Pareto-KG and scalarized-KG are the algorithms with the best empirical performance.


ieee symposium series on computational intelligence | 2015

Correlated Gaussian Multi-Objective Multi-Armed Bandit Across Arms Algorithm

Saba Q. Yahyaa; Mm Madalina Drugan

Stochastic multi-objective multi-armed bandit problem, (MOMAB), is a stochastic multi-armed problem where each arm generates a vector of rewards instead of a single scalar reward. The goal of (MOMAB) is to minimize the regret of playing suboptimal arms while playing fairly the Pareto optimal arms. In this paper, we consider Gaussian correlation across arms in (MOMAB), meaning that the generated reward vector of an arm gives us information not only about that arm itself but also on all the available arms. We call this framework the correlated-MOMAB problem. We extended Gittins index policy to correlated (MOMAB) because Gittins index has been used before to model the correlation between arms. We empirically compared Gittins index policy with multi-objective upper confidence bound policy on a test suite of correlated-MOMAB problems. We conclude that the performance of these policies depend on the number of arms and objectives.


congress on evolutionary computation | 2015

Annealing linear scalarized based multi-objective multi-armed bandit algorithm

Saba Q. Yahyaa; Mm Madalina Drugan; Bernard Manderick

A stochastic multi-objective multi-armed bandit problem is a particular type of multi-objective (MO) optimization problems where the goal is to find and play fairly the optimal arms. To solve the multi-objective optimization problem, we propose annealing linear scalarized algorithm that transforms the MO optimization problem into a single one by using a linear scalarization function, and finds and plays fairly the optimal arms by using a decaying parameter εt. We compare empirically linear scalarized-UCB1 algorithm with the annealing linear scalarized algorithm on a test suit of multi-objective multi-armed bandit problems with independent Bernoulli distributions using different approaches to define weight sets. We used the standard approach, the adaptive approach and the genetic approach. We conclude that the performance of the annealing scalarized and the scalarized UCB1 algorithms depend on the used weight approach.


international conference on agents and artificial intelligence | 2014

Online Knowledge Gradient Exploration in an Unknown Environment

Saba Q. Yahyaa; Bernard Manderick

We present online kernel-based LSPI (or least squares policy iteration) which is an extension of offline kernel-based LSPI. Online kernel-based LSPI combines characteristics of both online LSPI and offline kernel-based LSPI to improve the convergence rate as well as the optimal policy performances of the online LSPI. Online kernel-based LSPI uses knowledge gradient policy as an exploration policy and the approximate linear dependency based kernel sparsification method to select features automatically. We compare the optimal policy performance of online kernel-based LSPI and online LSPI on a 4 discrete Markov decision problems, where online kernel-based LSPI outperform online LSPI.


international conference on agents and artificial intelligence | 2014

Knowledge Gradient for Online Reinforcement Learning

Saba Q. Yahyaa; Bernard Manderick

The most interesting challenge for a reinforcement learning agent is to learn online in unknown large discrete, or continuous stochastic model. The agent has not only to trade-off between exploration and exploitation, but also has to find a good set of basis functions to approximate the value function. We extend offline kernel-based LSPI or least squares policy iteration to online learning. Online kernel-based LSPI combines feature of offline kernel-based LSPI and online LSPI. Online kernel-based LSPI uses knowledge gradient policy as an exploration policy to trade-off between exploration and exploitation, and the approximate linear dependency based kernel sparsification method to select basis functions automatically. We compare between online kernel-based LSPI and online LSPI on 5 discrete Markov decision problems, where online kernel-based LSPI outperforms online LSPI according to the optimal policy performance.


Archive | 2015

Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem

Saba Q. Yahyaa; Bernard Manderick

Collaboration


Dive into the Saba Q. Yahyaa's collaboration.

Top Co-Authors

Avatar

Bernard Manderick

Vrije Universiteit Brussel

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge