Is this you? Create Your Porfile

Rahul Meshram

Indian Institute of Technology Bombay

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rahul Meshram is active.

Explore More

Publication

Featured researches published by Rahul Meshram.

conference on decision and control | 2015

A restless bandit with no observable states for recommendation systems and communication link scheduling

Rahul Meshram; D. Manjunath; Aditya Gopalan

A restless bandit is used to model a users interest in a topic or item. The interest evolves as a Markov chain whose transition probabilities depend on the action (display the ad or desist) in a time step. A unit reward is obtained if the ad is displayed and if the user clicks on the ad. If no ad is displayed then a fixed reward is assumed. The probability of click-through is determined by the state of the Markov chain. The recommender never gets to observe the state but in each time step it has a belief, denoted by πt, about the state of the Markov chain. πt evolves as a function of the action and the signal from each state. For the one-armed restless bandit with two states, we characterize the policy that maximizes the infinite horizon discounted reward. We first characterize the value function as a function of the system parameters and then characterize the optimal policies for different ranges of the parameters. We will see that the Gilbert-Elliot channel in which the two states have different success probabilities becomes a special case. For one special case, we argue that the optimal policy is of the threshold type with one threshold; extensive numerical results indicate that this may be true in general.

conference on decision and control | 2016

Optimal recommendation to users that react: Online learning for a class of POMDPs

Rahul Meshram; Aditya Gopalan; D. Manjunath

We describe and study a model for an Automated Online Recommendation System (AORS) in which a users preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendations can affect future preferences, and (3) it allows for the development of learning algorithms with provable performance guarantees. The problem is cast as an average-cost restless multi-armed bandit for a given user, with an independent partially observable Markov decision process (POMDP) for each item of content. We analyze the POMDP for a single arm, describe its structural properties, and characterize its optimal policy. We then develop a Thompson sampling-based online reinforcement learning algorithm to learn the parameters of the model and optimize utility from the binary responses of the users to continuous recommendations. We then analyze the performance of the learning algorithm and characterize the regret. Illustrative numerical results and directions for extension to the restless hidden Markov multi-armed bandit problem are also presented.

communication systems and networks | 2017

Restless bandits that hide their hand and recommendation systems

Rahul Meshram; Aditya Gopalan; D. Manjunath

We consider a restless multi-armed bandit (RMAB) in which each arm can be in one of two states, say 0 or 1. Playing the arm brings it to state 0 with probability one and not playing it induces state transitions with arm-dependent probabilities. Playing an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems which in turn can be used in applications like creating of playlists or placement of advertisements. In this paper we analyse the RMAB by first showing that it is Whittle-indexable and then obtain a closed form expression for the Whittle index for each arm calculated from the belief about its state and the parameters that describe the arm. For an RMAB to be useful in practice, we need to be able to learn the parameters of the arms. We present an algorithm derived from Thompson sampling scheme, that learns the parameters of the arms and also evaluate its performance numerically.

Communication (NCC), 2016 Twenty Second National Conference on | 2016

Power control over Gilbert-Elliot channel with no observable states

Rahul Meshram

A dynamic communication channel is modeled as Markov chain where states describe the quality of channel. One such example is two state Gilbert-Elliot channel. The states of a channel is never observed by transmitter, but success and failure is observed with probability depending on state of channel. The information available to transmitter is the current belief about states and it is updated based on action and observation of a signal. The transmitter want to send a packet over channel with different power control schemes in each slot to maximise long term discounted reward. We formulate this as infinite horizon discounted reward problem. We write a dynamic program, and derive the properties of value function. For a special case, we show that the optimal policy has a single threshold. Further, we present few numerical examples to illustrate this.

international conference on communications | 2017

Relay employment problem for unacknowledged transmissions: Myopic policy and structure

Kesav Kaza; Rahul Meshram; S. N. Merchant

The idea of D2D relay has received much interest recently as an essential ingredient in next generation networks. Future networks with user relay assistance, will have issues regarding relay employment considering the trade offs between throughput gain and cost incurred. In this work, we formulate the relay employment problem, where a source assesses a candidate relays “employability” by accounting for the channel evolution with time in an unacknowledged transmission mode. We present a myopic policy which takes an initial belief about channel states as input and outputs a recommended sequence of actions. This sequence specifies whether to use a relay or not at each time slot. The myopic policy has different structures that impacts the gain obtained by the source. We present an analysis of this policy structure and provide sufficiency conditions for each of them. The myopic policy is compared with the one step decision policy which does not account for channel evolution. Numerical results show that relative gains up to 30% are obtained by myopic policy over one step decision policy.

communication systems and networks | 2017