Monte Carlo Rollout Policy for Recommendation Systems with Dynamic User Behavior
aa r X i v : . [ ee ss . S Y ] F e b Monte Carlo Rollout Policy for Recommendation Systems with DynamicUser Behavior
Rahul Meshram and Kesav Kaza
Abstract — We model online recommendation systems usingthe hidden Markov multi-state restless multi-armed banditproblem. To solve this we present Monte Carlo rollout policy.We illustrate numerically that Monte Carlo rollout policyperforms better than myopic policy for arbitrary transitiondynamics with no specific structure. But, when some structureis imposed on the transition dynamics, myopic policy performsbetter than Monte Carlo rollout policy.
I. I
NTRODUCTION
Online recommendation systems (RS) are extensively usedby multimedia hosting platforms e.g. YouTube, Spotify, andentertainment services e.g. Netflix, Amazon Prime etc. Thesesystems create personalized playlists for users based on userbehavioral information from individual watch history andalso by harvesting information from social networking sites.In this paper we provide new models for user behavior andalgorithms for recommendation.Most often playlists are generated using the “matrixcompletion” problem and items are recommended to usersbased on their past preferences. It is implicitly assumedthat user interest is static and the current recommendationdoes not influence future behavior of user interest. So, aplaylist that is generated does not take into account thedynamic behavior or changes in user interest triggered by thecurrent recommendation. In this paper we study a playlistgeneration system as a recommendation system, where aplaylist is generated using immediate dynamic behavior ofuser interest. The user responds to different items differently.This behavior depends on the play history along with someelement of randomness in the preferences.We consider a Markov model for user interest or prefer-ences where a state describes the intensity level of prefer-ences . A higher state means higher level of interest for anitem. The user behavior for an item is determined by thetransition dynamics for that item. We assume that the userprovides a binary feedback upon the play of an item, and nofeedback from not playing it. The likelihood of observing feedback is state dependent.The user interest goes to different states with differentprobability after playing an item. User interest for an itemreturns to a fixed state whenever it is not played. The item forwhich user interest stays in higher state with high probabilityafter its play, is referred as a viral item . For certain items,
Email: [email protected], [email protected] Markov model is an approximation of the dynamic behavior of userinterest to make the analysis tractable. In general, user behavior can bemore complex and requires further investigation. In general there can be other forms of feedback such as the user stoppinga video in between, etc. user interest drops immediately after playing it; these arereferred as normal items . Our objective is to model andanalyze the diverse behavior of user interest for differentitems, and generate a dynamic playlist using binary feedback.Note that the state of user interest is not observable by therecommendation system. This is an example of multi-statehidden Markov model. Our model here is a generalization ofthe two-state hidden Markov model in [1]. This paper studiesa playlist generation system using multi-state hidden Markovmodel.We make following contributions in this paper.1) We model a playlist generation (recommendation) sys-tem as hidden Markov multi-state restless multi armedbandit problem. We present a four state model. An itemin an RS is modeled using a POMDP. This is given inSection II.2) We present the following solution approaches—myopic policy, Monte Carlo rollout policy and Whittle-index policy in Section III. The Whittle-index policyhas limited applicability due to lack of an explicit indexformula for multi-state hidden Markov bandits.3) We discuss numerical examples in Section IV. Wepresent numerical results with myopic and MonteCarlo rollout policy. Our first numerical example illus-trates that myopic policy performs better than MonteCarlo rollout policy whenever transition probabilitiesof interest states have a specific structure such asstochastic dominance. But myopic policy performspoorly compared to Monte Carlo rollout policy when-ever there is no such structure imposed on the model.This is demonstrated in the second numerical example.In the third example, we compare Monte Carlo rolloutpolicy with Whittle index policy and we observe thatMonte Carlo policy performs better than Whittle indexpolicy and this is due approximations involved in indexcalculations.
A. Related Work
Recommendation systems are often studied using collabo-rative filtering methods, [2]–[4]. Matrix factorization (MF) isone such method employed in collaborative filtering, [5], [6].The idea is to represent a matrix as users and items. Eachentry there describes the user rating for an item. MF methodthen transforms a large dimensional matrix into a lowerdimensional matrix. Machine learning techniques are used inMF and collaborative filtering, [7]. Recommendation systemsideas inspired on work of Matrix completion problem [8]. Inall of these models are based on data which is obtained fromrevious recommendations or historical data. These worksassume that user preferences are static and it does not takeinto account the dynamic behavior of user based on feedbackfrom preceding recommendations.Recently, there is another body of work on modelingonline recommendation systems. This work is inspired fromonline learning with bandit algorithms, [9]–[11]. It usescontextual epsilon greedy algorithms for news recommenda-tion. Another way to model online recommendation systemssuch as playlist generation systems are restless multi-armedbandits, [1]. In all these systems, user interest dynamicallyevolves and this evolution is dependent on whether an itemis recommended or not.We now describe some related work on RMAB, hiddenMarkov RMAB and their solution methodologies. RMAB isextensively studied for various application of communicationsystems, queuing networks and resource allocation problems,[12], [13]. RMAB problem is NP-hard, [14], but heuristicindex based policy is used. To use such index based policy,there is requirement of structure on the dynamics of restlessbandits. This can be limitation for hidden Markov RMAB,when each restless bandit is modeled using POMDP and itis very difficult to obtain structural results. This motivatesus to look for an alternative policy, and Monte Carlo rolloutpolicy is studied in this work. Monte Carlo rollout policyhas been developed for complex Markov decision processesin [15]–[18].II. O
NLINE R ECOMMENDATION S YSTEM AS R ESTLESS M ULTI - ARMED B ANDITS
We present models of online recommendation systems(RS). There are different types of items to be recommended.A model for each type describes the specific user behaviorfor that type of item. We consider a four state model wherea state represents the user interest for an item. The statesare called as Low interest (L), Medium interest (M), Highinterest (H) and Very high interest (V). Thus, the state spaceis S = { L, M, H, V } . RS can play an item or not play thatitem. The state evolution of user interest for an item dependson actions. There are two actions for each item, play or notplay, i.e., A = { , } , where corresponds to not playingand corresponds to play of item.We suppose that RS gets a binary observation signal,i.e., for like and for dislike . In general, RS can havemore than two signals as observations but for simplicity weconsider only two signals. RS can not directly observe userinterest for items and hence the state of each item is notobservable. When an item is played the user clicks on eitherlike or dislike (skip) buttons with probability ρ i , and thisclick-through probability depends on current state of userinterest i for that item but not on the state of any other ofitems. Whenever user clicks, RS accrues a unit reward withprobability ρ i for i ∈ S. Further, we assume ρ L < ρ M <ρ H < ρ V . Thus, each item can be modeled as partially These observations dictate the actions of user based on their interest. Forexample, the user may skip an item when he dislikes it, or watch completelywhen he likes it. Further, more signals correspond to more actions from user. observable Markov decision process (POMDP) and it hasfinite states with two actions. From literature on POMDP[19]–[21], a belief vector π = ( π (1) , π (2) , π (3) , π (4)) ismaintained for each item, where π ( i ) is the probability thatuser interest for the item is in state i and P i ∈ S π ( i ) = 1 . Theimmediate expected reward to RS from play of an item withbelief π is ρ ( π ) = ρ L π ( L )+ ρ M π ( M )+ ρ H π ( L )+ ρ V π ( V ) . When an item is not played, this implies that another itemis played to the user. In this way the items are competingat RS for each time slot. The user interest state evolutionof each item is dependent on whether that item is played ornot. RS is an example of restless multi-armed bandit problem(RMAB) , [12].Suppose there are N independent items, each item has thesame number of states. After each play of an item, a unitreward is obtained by RS based on the user click. Further,RS can play only one arm (item) at each time instant. Theobjective of RS is to maximize the long term discountedcumulative reward (sum of cumulative reward from play ofall items over the long term) subject to constraint that onlyone item is played at a time. Because RS does not observethe state of each user interest for item at each time step,we refer to this as hidden Markov RMAB , [22]. This is aconstrained optimization problem and the items are coupleddue to the integer constraint on RS. It is also called as weaklycoupled POMDPs. III. S
OLUTION A PPROACH
We discuss the following solution approaches—myopicpolicy, Monte-Carlo rollout policy and Whittle index policy.We first describe the belief update rule for an item. Afterplay of an item, the belief is π t +1 ( l ) = P i ∈ S π t ( i ) p i,l ρ i P j ∈ S P i ∈ S π t ( i ) p i,j ρ i , and π t +1 = ( π t +1 ( L ) , π t +1 ( M ) , π t +1 ( H ) , π t +1 ( V )) . Here, π t is the belief vector at time t and P = [[ p i,j ]] is thetransition probability matrix for an item. For not playing,no signal is observed and hence the posterior belief π t +1 = π t P . A. Myopic policy
This is the simplest policy for RMAB with hidden states.In any given slot the item with the highest immediate ex-pected payoff is played. Let π j,t be the belief vector for item j at time t. A unit reward is obtained from playing item j depending on state, with prob. ρ j = [ ρ L,j , ρ
M,j , ρ
H,j , ρ
V,j ] . Thus, the immediate expected payoff from play of item j is P i ∈ S π j,t ( i ) ρ i,j . The myopic policy plays an item j ∗ = arg max ≤ j ≤ N X i ∈S π j,t ( i ) ρ i,j . B. Look ahead Policy using Monte Carlo Method
We study Monte Carlo rollout policy. There are L trajec-tories simulated for fixed horizon length H, using a knowntransition and reward model. Along each trajectory, a fixedpolicy φ is employed according to which one item is playedat each time step. The information obtained from a singletrajectory upto horizon length H is { π j,t,l , a j,t,l , R φj,t,l } N,Hj =1 ,t =1 (1)nder policy φ. Here, l denotes a trajectory. The valueestimate of trajectory l starting from belief state π =( π , · · · , π N ) , for N items and initial action a ∈{ , , · · · , N } is Q φH,l ( π, a ) = H X h =1 β h − R φh,l = H X h =1 β h − r ( π h,l , a h,l , φ ) . Then, the value estimate for state π and action a over L trajectories under policy φ is e Q φH,L ( π, a ) = 1 L L X l =1 Q φH,l ( π, a, W ) . Here, policy φ can be uniform random policy or myopic(greedy) policy that is implemented for a trajectory. Next, aone step policy improvement is performed, and the optimalaction selected is according follow rule. j ∗ ( π ) = arg max ≤ j ≤ N h r ( π, a = j ) + β e Q φH,L ( π, a = j ) i . (2)In each time step, an item is played based on the aboverule. Detailed discussion on rollout policy for RMAB is givenin [18]. C. Whittle-index policy
Another popular approach for RMAB (and weakly coupledPOMDPs) is Whittle-index policy [12], where the con-strained optimization problem can be solved via relaxingthe integer constraints. The problem is transformed into anoptimization problem with discounted constraints. Later, us-ing Lagrangian technique, one decouples relaxed constrainedoptimization RMAB problem into N single-armed restlessbandit problems. In a single-armed restless bandit (SARB)problem, a subsidy W for not playing the item is introduced.A SARB with hidden states is an example of a POMDP witha two action model, [22]. To use Whittle index policy, onerequires to study structural properties of SARB, show theexistence of a threshold policy, indexability for each item,and compute the indices for all items. In each time step, theitem with highest index is played.In our model, it is very difficult to claim indexability andobtain closed form index formula. The idexability conditionrequire us to show a threshold policy behavior for eachitem. In [21, Proposition and Lemma . ], authors haveshown existence of threshold policy for specialized modelin POMDP. In a specialized model, it is possible to showindexability (detail is omitted) and use Monte Carlo basedindex computation algorithm, see [18, Section IV, Algorithm ]. Note that this algorithm is computationally expensive andtime consuming because Monte-Carlo algorithm has to runfor each restless bandits till their value function converges.IV. N UMERICAL R ESULTS FOR M ODEL We now present numerical examples that illustrate the per-formance of myopic policy and Monte-Carlo rollout policy.In the first example we observe that myopic policy performsbetter than MC rollout policy for some structural assumptions on transition probabilities and reward probabilities. Finally,in the second example there are no structural assumptionson the transition probabilities. Here the MC rollout policyperforms better than myopic policy.
1) Example- : In this example, we introduce structure ontransition probability matrices of items, as in [21]. When anitem is played, the user interest evolves according to differenttransition matrix corresponding to different items. But fornot played items, the user interest evolves according to acommon transition matrix. We use the following parameterset. The number of items N = 5 , number of states S = 4 .β = 0 . Transition probability for items when that is playedis denoted by P j , j = 1 , , · · · , N. When item is not played,then the transition probability matrix is P and it is samefor all items. P = . . . . . . . .
50 0 0 .
25 0 . , P = . . . .
35 0 .
35 00 0 .
25 0 .
25 0 .
50 0 0 .
25 0 . ,P = .
45 0 .
55 0 00 . . . . .
35 0 .
450 0 0 . . P = . . . . . . .. .
40 0 0 . . P = . . .
25 0 .
35 0 . . .
35 0 .
350 0 0 .
45 0 . P = .
45 0 .
55 0 00 .
15 0 . .
45 00 0 . . .
50 0 0 . . . Reward vector for all items. ρ = . . . . .
25 0 . . . . . . . . .
35 0 .
55 0 . .
25 0 . . . . Initial belief π = . . . . . .
25 0 . . .
15 0 . . . . . . . .
25 0 .
25 0 .
25 0 . Initial state of items from different states. X =[2 , , , , . From Fig. 1, we find that the myopic policyperforms better than Monte Carlo rollout policy. In MonteCarlo rollout policy, we used H = 5 and L = 100 .
2) Example- : We consider a general transition probabil-ity matrix for each action, and with no structural assumption.Hence, we do not have stochastic dominance condition forthe transition probability matrix of each item. When itemis played, the user interest evolves according to differenttransition matrices for different items but for not playeditems, the user interest evolves according to a commontransition matrix. We use the following parameter set. Thenumber of items N = 5 , number of states S = 4 . We usefollowing parameters. β = 0 . Transition probability foritems when that is played is denoted by P j , j = 1 , , · · · , N.
20 40 60 80 100 120 140 160 180 200
Iteration D i sc oun t ed C u m u l a t i v e R e w a r d MyopicMC Rollout H=5
Fig. 1. Comparison of myopic policy and Monte Carlo rollout policy forExample with H = 5 . When item is not played, then the transition probabilitymatrix is P and it is same for all items. P = . . . . . .
30 0 . . , P = . . . . . . .
45 0 0 .
45 0 . ,P = .
45 0 .
55 0 00 . . . . .
35 0 . . . P = . . . . . . .. . . . . P = . . .
25 0 .
35 0 . . .
35 0 .
350 0 . .
25 0 . P = . . .
25 0 .
75 0 00 . . .
05 0 .
95 0 0
Reward vector for all items. ρ = . . . . . . . . . . . . . .
35 0 .
55 0 . .
25 0 . . . The initial belief vector and initial state is same as inexample- . We compare expected discounted cumulativereward with myopic and Monte Carlo rollout policy in Fig. 2.For Monte-Carlo rollout policy we use length of a trajectory H = 5 and number of trajectories L = 100 . With myopicpolicy, items and are played most frequently, whereaswith MC rollout policy, item is played most frequently.
3) Example : In this example we use same transitionprobability matrix as in example-1 when item is played. Butwhen item is not played, the transition happens to state . This is different from example and also reward matrix isdifferent. We use same initial belief and initial state as inexample . We use discount parameter β = 0 . . P = . . . . . . . .
50 0 0 .
25 0 . , P = . . . .
35 0 .
35 00 0 .
25 0 .
25 0 .
50 0 0 .
25 0 . , Iteration D i sc oun t ed c u m u l a t i v e r e w a r d MyopicMC Rollout
Fig. 2. Comparison of Myopic policy and Monte Carlo rollout policy forExample with H = 5 . Iteration D i sc oun t ed C u m u l a t i v e R e w a r d MC RolloutMyopicWhittle Index
Fig. 3. Comparison of Myopic policy, Monte Carlo rollout policy andWhittle index policy for Example with H = 5 .P = .
45 0 .
55 0 00 . . . . .
35 0 .
450 0 0 . . P = . . . . . . .. .
40 0 0 . . P = . . .
25 0 .
35 0 . . .
35 0 .
350 0 0 .
45 0 . P = . Reward vector for all items. ρ = . . . . .
25 0 .
45 0 .
55 0 . . . . . . .
35 0 .
55 0 . . . . . . We observe from Fig. 3 that Myopic policy and MonteCarlo rollout policy performs better than Whittle index pol-icy. This may be due to approximation used for index com-putation, lack of explicit formula or structure of the problem.As we stated earlier, index policy is more computationally
50 100 150
Iteration D i sc oun t ed c u mm u l a t i v e r e w a r d MyopicMC Rollout
Fig. 4. Comparison of Myopic policy and Monte Carlo rollout policy forExample with H = 5 . expensive than Monte Carlo rollout policy and myopic policywhen there is no explicit closed formula in case of hiddenMarkov bandit. In such examples, Monte Carlo rollout policyis good alternative. The discount parameter β = 0 . A. Example In this example we consider items with state. Wecompare only myopic policy and Monte Carlo rollout policy.We do not assume monotonicity structure on transitionmatrix. The comparison is illustrated in Fig. 4. We observethat Monte-Carlo rollout policy performs better than myopicpolicy, i.e, upto . In Monte Carlo rollout policy, we use H = 5 and L = 100 . V. C
ONCLUSIONS
We have studied an online recommendation system prob-lem using hidden Markov RMAB and provided numericalresults for Monte Carlo rollout policy and myopic policy. Weobserved that Monte Carlo rollout policy performs better forarbitrary transition dynamics. We observe numerically thatmyopic policy performs better than Monte Carlo wheneverstructure on state transition dynamics. We also presented theperformance of Whittle index policy and that is comparedwith Monte Carlo rollout policy for a specialized model.The objective in paper was to describe a new MonteCarlo rollout algorithm for RS with Markov model. Wehave demonstrated the performance of the algorithm ona small scale example. This study can be extended forlarge scale examples, e.g., large number of items upto fewhundreds. Looking at the scalability problem, even thoughan RS might have millions of items in its database, it mayonly recommend items from a small subset consideringthe cognitive limitations of humans and the problem ofinformation overload. R
EFERENCES[1] R. Meshram, A. Gopalan, and D. Manjunath, “Restless banditsthat hide their hand and recommendation systems,” in
Proc. IEEECOMSNETS , 2017.[2] C. C. Aggarwal,
Recommender Systems , Springer, 2016. [3] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon,and J. Riedl, “Grouplens: Applying collaborative filtering to usenetnews,”
Communication of ACM , 1997.[4] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collabo-rative filtering recommendation algorithms,” in
WWWW10 , 2001, pp.285–295.[5] Y. Coren, R. Bell, and C. Volinsky, “Matrix factorization techniquesfor recommender systems,”
IEEE Computer , 2009.[6] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborativefiltering recommender systems,”
The Adaptive Web Lecture notes inComputer Science , 2007.[7] X. He, L. Liao, H. Zhang, L. Nie, X. Liu, and T. Chua, “Neuralcollaborating filtering,”
Arxiv , 2017.[8] E. Candes and T. Tao, “The power of convex relaxation: Near optimalmatrix completion,”
IEEE Transactions on Information Theory , vol.56, no. 5, pp. 2053–2080, May 2010.[9] J. Langford and T. Zhang, “The epoch-greedy algorithm for contextualmulti-armed bandits,” in
Proc. NIPS , 2007.[10] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-banditapproach to personalized news article recommendation,” in
Proc. ACMWWW , 2010.[11] D. Glowacka,
Bandit Algorithms in Information Retrieval , Founda-tions and Trends in Information Retrieval NOW, 2019.[12] P. Whittle, “Restless bandits: Activity allocation in a changing world,”
Journal of Applied Probability , vol. 25, no. A, pp. 287–298, 1988.[13] J. Gittins, K. Glazebrook, and R. Weber,
Multi-armed Bandit Alloca-tion Indices , John Wiley and Sons, New York, 2nd edition, 2011.[14] C. H. Papadimitriou and J. H. Tsitsiklis, “The complexity of optimalqueueing network control,”
Mathematics of Operations Research , vol.24, no. 2, pp. 293–305, May 1999.[15] G. Tesauro and G. R. Galperin, “On-line policy improvement usingmonte carlo search,” in
NIPs , 1996, pp. 1–7.[16] H. S. Chang, R. Givan, and E. K. P. Chong, “Parallel rollout for onlinesolution of partially observable markov decision processes,”
DiscretEvent Dynamical Systems , vol. 14, pp. 309–341, 2004.[17] D. P. Bertsekas,
Distributed Reinforcement Learning, Rollout, andApproximate Policy Iteration , Athena Scientific, 2020.[18] R. Meshram and K. Kaza, “Simulation based algorithms for Markovdecision processes and multi-action restless bandits,” Arxiv, 2020.[19] R. D. Smallwood and E. J. Sondik, “The optimal control of partiallyobservable processes over a finite horizon,”
Operations Research , vol.21, no. 5, pp. 1019–1175, Sept.-Oct. 1973.[20] E. J. Sondik, “The optimal control of partially observable Markovprocesses over the infinite horizon: Discounted costs,”
OperationsResearch , vol. 26, no. 2, pp. 282–304, March–April 1978.[21] W. S. Lovejoy, “Some monotonicity results for partially observedMarkov decision processes,”
Operations Research , vol. 35, no. 5, pp.736–743, October 1987.[22] R. Meshram, D. Manjunath, and A. Gopalan, “On the Whittle indexfor restless multi-armed hidden markov bandits,”