Shalabh Bhatnagar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shalabh Bhatnagar is active.

Explore More

Publication

Featured researches published by Shalabh Bhatnagar.

Automatica | 2009

Natural actor-critic algorithms

Shalabh Bhatnagar; Richard S. Sutton; Mohammad Ghavamzadeh; Mark Lee

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.

IEEE Transactions on Intelligent Transportation Systems | 2011

Reinforcement Learning With Function Approximation for Traffic Signal Control

L A Prashanth; Shalabh Bhatnagar

We propose, for the first time, a reinforcement learning (RL) algorithm with function approximation for traffic signal control. Our algorithm incorporates state-action features and is easily implementable in high-dimensional settings. Prior work, e.g., the work of Abdulhai , on the application of RL to traffic signal control requires full-state representations and cannot be implemented, even in moderate-sized road networks, because the computational complexity exponentially grows in the numbers of lanes and junctions. We tackle this problem of the curse of dimensionality by effectively using feature-based state representations that use a broad characterization of the level of congestion as low, medium, or high. One advantage of our algorithm is that, unlike prior work based on RL, it does not require precise information on queue lengths and elapsed times at each lane but instead works with the aforementioned described features. The number of features that our algorithm requires is linear to the number of signaled lanes, thereby leading to several orders of magnitude reduction in the computational complexity. We perform implementations of our algorithm on various settings and show performance comparisons with other algorithms in the literature, including the works of Abdulhai and Cools , as well as the fixed-timing and the longest queue algorithms. For comparison, we also develop an RL algorithm that uses full-state representation and incorporates prioritization of traffic, unlike the work of Abdulhai We observe that our algorithm outperforms all the other algorithms on all the road network settings that we consider.

ACM Transactions on Modeling and Computer Simulation | 2003

Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences

Shalabh Bhatnagar; Michael C. Fu; Steven I. Marcus; I-Jeng Wang

Simultaneous perturbation stochastic approximation (SPSA) algorithms have been found to be very effective for high-dimensional simulation optimization problems. The main idea is to estimate the gradient using simulation output performance measures at only two settings of the N-dimensional parameter vector being optimized rather than at the N + 1 or 2N settings required by the usual one-sided or symmetric difference estimates, respectively. The two settings of the parameter vector are obtained by simultaneously changing the parameter vector in each component direction using random perturbations. In this article, in order to enhance the convergence of these algorithms, we consider deterministic sequences of perturbations for two-timescale SPSA algorithms. Two constructions for the perturbation sequences are considered: complete lexicographical cycles and much shorter sequences based on normalized Hadamard matrices. Recently, one-simulation versions of SPSA have been proposed, and we also investigate these algorithms using deterministic sequences. Rigorous convergence analyses for all proposed algorithms are presented in detail. Extensive numerical experiments on a network of M/G/1 queues with feedback indicate that the deterministic sequence SPSA algorithms perform significantly better than the corresponding randomized algorithms.

ACM Transactions on Modeling and Computer Simulation | 2005

Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization

Shalabh Bhatnagar

We develop in this article, four adaptive three-timescale stochastic approximation algorithms for simulation optimization that estimate both the gradient and Hessian of average cost at each update epoch. These algorithms use four, three, two, and one simulation(s), respectively, and update the values of the decision variable and Hessian matrix components simultaneously, with estimates based on the simultaneous perturbation methodology. Our algorithms use coupled stochastic recursions that proceed using three different timescales or step-size schedules. We present a detailed convergence analysis of the algorithms and show numerical experiments using all the developed algorithms on a two-node network of M/G/1 queues with feedback for a 50-dimensional parameter vector. We provide comparisons of the performance of these algorithms with two recently developed two-timescale steepest descent simultaneous perturbation analogs that use randomized and deterministic perturbation sequences, respectively. We also present experiments to explore the sensitivity of the algorithms to their associated parameters. The algorithms that use four and three simulations, respectively, perform significantly better than the rest of the algorithms.

ACM Transactions on Modeling and Computer Simulation | 2007

Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization

Shalabh Bhatnagar

In this article, we present three smoothed functional (SF) algorithms for simulation optimization. While one of these estimates only the gradient by using a finite difference approximation with two parallel simulations, the other two are adaptive Newton-based stochastic approximation algorithms that estimate both the gradient and Hessian. One of the Newton-based algorithms uses only one simulation and has a one-sided estimate in both the gradient and Hessian, while the other uses two-sided estimates in both quantities and requires two simulations. For obtaining gradient and Hessian estimates, we perturb each parameter component randomly using independent and identically distributed (i.i.d) Gaussian random variates. The earlier SF algorithms in the literature only estimate the gradient of the objective function. Using similar techniques, we derive two unbiased SF-based estimators for the Hessian and develop suitable three-timescale stochastic approximation procedures for simulation optimization. We present a detailed convergence analysis of our algorithms and show numerical experiments with parameters of dimension 50 on a setting involving a network of M/G/1 queues with feedback. We compare the performance of our algorithms with related algorithms in the literature. While our two-simulation Newton-based algorithm shows the best results overall, our one-simulation algorithm shows better performance compared to other one-simulation algorithms.

Iie Transactions | 2001

Two-timescale algorithms for simulation optimization of hidden Markov models

Shalabh Bhatnagar; Michael C. Fu; Steven I. Marcus; Shashank Bhatnagar

We propose two finite difference two-timescale Simultaneous Perturbation Stochastic Approximation (SPSA) algorithms for simulation optimization of hidden Markov models. Stability and convergence of both the algorithms is proved. Numerical experiments on a queueing model with high-dimensional parameter vectors demonstrate orders of magnitude faster convergence using these algorithms over related (N + 1)-Simulation finite difference analogues and another Two-Simulation finite difference algorithm that updates in cycles.

IEEE Transactions on Automatic Control | 2004

A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes

Shalabh Bhatnagar; Shishir Kumar

A two-timescale simulation-based actor-critic algorithm for solution of infinite horizon Markov decision processes with finite state and compact action spaces under the discounted cost criterion is proposed. The algorithm does gradient search on the slower timescale in the space of deterministic policies and uses simultaneous perturbation stochastic approximation-based estimates. On the faster scale, the value function corresponding to a given stationary policy is updated and averaged over a fixed number of epochs (for enhanced performance). The proof of convergence to a locally optimal policy is presented. Finally, numerical experiments using the proposed algorithm on flow control in a bottleneck link using a continuous time queueing model are shown.

Multimedia Systems | 2008

An efficient ad recommendation system for TV programs

Sudha Velusamy; Lakshmi Gopal; Shalabh Bhatnagar; Sridhar Varadarajan

With broadcast Television (TV) going digital, the number of channels and the programs aired have increased tremendously. Millions of audiences of various categories such as adults, children, youth and families watch these programs. Advertisements (ads) aired during these programs are targeted to reach these varied audiences and are the main revenue earners for TV broadcasters. While TV broadcasters have the task of scheduling hundreds of ads during the various ad breaks of programs, it is important that the ads shown during any ad break have a good impact on the viewers. An intelligent ad recommendation system that takes into account various factors such as ad/program content, viewers’ interests, sponsors’ preferences, program timing, program popularity and the available ad slot that help increasing the ad revenue would be useful for sponsors and broadcasters. We present in this paper a single end-to-end ad recommender system that considers all of these factors and recommends a set of well scheduled and sequenced ads that are the best suited for a given TV ad break. The proposed recommendation system captures the features of the ad video in terms of annotations derived from MPEG-7 descriptions and these annotation keywords are systematically grouped into a number of pre-defined semantic categories by using a categorization technique. A fuzzy categorical data clustering technique is then applied on the categorized data for grouping the best suited ads for a set of pre-defined program classes such as News, Sports, Cartoons etc. The program classes considered are selected to match with the TV program genres proposed in the TV-anytime standard. Since the same ad can be recommended to more than one program depending upon multiple parameters, fuzzy clustering acts as a well suited (and perhaps also the best suited) technique for ad recommendation. The relative fuzzy score called “degree of membership” calculated for each ad is an indicator of the number of program clusters to which the given ad belongs to. The clustered ads are then scheduled using an algorithm that takes into consideration parameters such as program popularity, program timing and available ad slots, to provide the best possible package for sponsors to show their ads. The scheduled set of ads if played randomly during an ad break might make viewers (sponsors) unhappy, for instance, when similar (competing) product ads get played consecutively. Hence, the system employs sequencing algorithm that takes into account the pre- and post-ad sequences in order to better order the scheduled set of ads in any ad break. We show that our proposed recommendation system provides an effective way of recommending the right ads for broadcast TV programs. We also demonstrate that this strategy does indeed help sponsors to attract viewers’ attention while playing their ads during ad breaks of TV programs. The proposed ad recommendation system is compared and evaluated subjectively with the current ad display system, by ten different people, and is rated with a high success score.

Simulation | 2003

Multiscale Chaotic SPSA and Smoothed Functional Algorithms for Simulation Optimization

Shalabh Bhatnagar; Vivek S. Borkar

The authors propose a two-timescale version of the one-simulation smoothed functional (SF) algorithm with extra averaging. They also propose the use of a chaotic simple deterministic iterative sequence for generating random samples for averaging. This sequence is used for generating the N independent and identically distributed (i.i.d.), Gaussian random variables in the SF algorithm. The convergence analysis of the algorithms is also briefly presented. The authors show numerical experiments on the chaotic sequence and compare performance with a good pseudo-random generator. Next they show experiments in two different settings—a network of M/G/1 queues with feedback and the problem of finding a closed-loop optimal policy (within a prespecified class) in the available bit rate (ABR) service in asynchronous transfer mode (ATM) networks, using all the algorithms. The authors observe that algorithms that use the chaotic sequence show better performance in most cases than those that use the pseudo-random generator.

IEEE Wireless Communications Letters | 2013

Q-Learning Based Energy Management Policies for a Single Sensor Node with Finite Buffer

K J Prabuchandran; Sunil Kumar Meena; Shalabh Bhatnagar

In this paper, we consider the problem of finding optimal energy management policies in the presence of energy harvesting sources to maximize network performance. We formulate this problem in the discounted cost Markov decision process framework and apply two reinforcement learning algorithms. Prior work obtains optimal policy in the case when the conversion function mapping energy to data transmitted is linear and provides heuristic policies in the case when the same is nonlinear. Our algorithms, however, provide optimal policies regardless of the form of the conversion function. Through simulations, our policies are seen to outperform those of in the nonlinear case.

Explore More