[PDF] Reinforcement Learning for Strategic Recommendations

Abstract

Strategic recommendations (SR) refer to the problem where an intelligent agent observes the sequential behaviors and activities of users and decides when and how to interact with them to optimize some long-term objectives, both for the user and the business. These systems are in their infancy in the industry and in need of practical solutions to some fundamental research challenges. At Adobe research, we have been implementing such systems for various use-cases, including points of interest recommendations, tutorial recommendations, next step guidance in multi-media editing software, and ad recommendation for optimizing lifetime value. There are many research challenges when building these systems, such as modeling the sequential behavior of users, deciding when to intervene and offer recommendations without annoying the user, evaluating policies offline with high confidence, safe deployment, non-stationarity, building systems from passive data that do not contain past recommendations, resource constraint optimization in multi-user systems, scaling to large and dynamic actions spaces, and handling and incorporating human cognitive biases. In this paper we cover various use-cases and research challenges we solved to make these systems practical.

Full PDF

RR EINFORCEMENT L EARNING FOR S TRATEGIC R ECOMMENDATIONS

A P

REPRINT

Georgios Theocharous

Adobe Research [email protected]

Yash Chandak

University of Massachusetts Amherst [email protected]

Philip S. Thomas

University of Massachusetts Amherst [email protected]

Frits de Nijs

Monash University [email protected]

September 17, 2020 A BSTRACT

Strategic recommendations (SR) refer to the problem where an intelligent agent observes the sequen-tial behaviors and activities of users and decides when and how to interact with them to optimizesome long-term objectives, both for the user and the business. These systems are in their infancyin the industry and in need of practical solutions to some fundamental research challenges. AtAdobe research, we have been implementing such systems for various use-cases, including pointsof interest recommendations, tutorial recommendations, next step guidance in multi-media editingsoftware, and ad recommendation for optimizing lifetime value. There are many research challengeswhen building these systems, such as modeling the sequential behavior of users, deciding when tointervene and offer recommendations without annoying the user, evaluating policies ofﬂine withhigh conﬁdence, safe deployment, non-stationarity, building systems from passive data that do notcontain past recommendations, resource constraint optimization in multi-user systems, scaling tolarge and dynamic actions spaces, and handling and incorporating human cognitive biases. In thispaper we cover various use-cases and research challenges we solved to make these systems practical. K eywords Reinforcement Learning · Recommendations

In strategic recommendation (SR) systems, the goal is to learn a strategy that sequentially selects recommendationswith the highest long-term acceptance by each visiting user of a retail website, a business, or a user interactive systemin general. These systems are in their infancy in the industry and in need of practical solutions to some fundamentalresearch challenges. At Adobe research, we have been implementing such SR systems for various use-cases, includingpoints of interest recommendations, tutorial recommendations, next step guidance in multi-media editing software, andad recommendation for optimizing lifetime value.Most recommendation systems today use supervised learning or contextual bandit algorithms. These algorithms as-sume that the visits are i.i.d. and do not discriminate between visit and user, i.e., each visit is considered as a newuser that has been sampled i.i.d. from the population of the business’s users. As a result, these algorithms are myopicand do not try to optimize the long-term effect of the recommendations on the users. Click through rate (CTR) is asuitable metric to evaluate the performance of such greedy algorithms. Despite their success, these methods are be-coming insufﬁcient as users incline to establish longer and longer-term relationship with their websites (by going backto them). This increase in returning users further violates the main assumption underlying supervised learning andbandit algorithms, i.e., there is no difference between a visit and a user. This is the main motivation for SR systemsthat we examine in this paper. a r X i v : . [ c s . L G ] S e p PREPRINT - S

EPTEMBER

17, 2020Reinforcement learning (RL) algorithms that aim to optimize the long-term performance of the system (often for-mulated as the expected sum of rewards/costs) seem to be suitable candidates for SR systems. The nature of thesealgorithms allows them to take into account all the available knowledge about the user in order to select an offer orrecommendation that maximizes the total number of times she will click or accept the recommendation over multiplevisits, also known as the user’s life-time value (LTV). Unlike myopic approaches, RL algorithms differentiate betweena visit and a user, and consider all the visits of a user (in chronological order) as a system trajectory. Thus, theymodel the user, and not their visits, as i.i.d. samples from the population of the users of the website. This means thatalthough we may evaluate the performance of the RL algorithms using CTR, this is not the metric that they optimize,and thus, it would be more appropriate to evaluate them based on the expected total number of clicks per user (overthe user’s trajectory), a metric we call LTV. This long-term approach to SR systems allows us to make decisions thatare better than the short-sighted decisions made by the greedy algorithms. Such decisions might propose an offer thatis considered as a loss to the business in the short term, but increases the user loyalty and engagement in the long term.Using RL for LTV optimization is still in its infancy. Related work has experimented with toy examples and hasappeared mostly in marketing venues [50, 30, 73]. An approach directly related to ours ﬁrst appeared in [49], wherethe authors used public data of an email charity campaign, batch RL algorithms, and heuristic simulators for evaluation,and showed that RL policies produce better results than myopics. Another one is [57], which proposed an on-line RLsystem that learns concurrently from multiple customers. The system was trained and tested on a simulator. A recentapproach uses RL to optimizes LTV for slate recommendations [29]. It addresses the problem of how to decompose theLTV of a slate into a tractable function of its component item-wise LTVs. Unlike most of previous previous work, weaddress many more challenges that are found when dealing real data. These challenges, which hinder the widespreadapplication of RL technology to SR systems include: • High conﬁdence off-policy evaluation refers to the problem of evaluating the performance of an SR systemwith high conﬁdence before costly A/B testing or deployment. • Safe deployment refers to the problem of deploying a policy without creating disruption from the previousrunning policy. For example, we should never deploy a policy that will have a worse LTV than the previous. • Non-stationarity refers to the fact that the real world is non-stationary. In RL and Markov decision processesthere is usually the assumption that the transition dynamics and reward are stationary over time. This is oftenviolated in the marketing world where trends and seasonality are always at play. • Learning from passive data refers to the fact that there is usually an abundance of sequential data or eventsthat have been collected without a recommendation system in place. For example, websites record the se-quence of products and pages a user views. Usually in RL, data is in the form of sequences of states, actionsand rewards. The question is how can we leverage passive data that do not contain actions to create a recom-mendation systems that recommends the next page or product. • Recommendation acceptance factors refers to the problem of deeper understanding of recommendationacceptance than simply predicting clicks. For example, a person might have low propensity to listen dueto various reasons of inattentive disposition. A classic problem is the ‘recommendation fatigue’, wherepeople may quickly stop paying attention to recommendations such as ads and promotional emails, if theyare presented too often. • Resource constraints in multi-user systems refers to the problems of constraints created in multi-user rec-ommendation systems. For example, if multiple users in a theme park are offered the same strategy fortouring the park, it could overcrowd various attractions. Or, if a store offers the same deal to all users, itmight deplete a speciﬁc resource. • Large action spaces refers to the problem of having too many recommendations. Netﬂix for example em-ploys thousands of movie recommendations. This is particularly challenging for SR systems that make asequence of decisions, since the search space grows exponentially with the planning horizon (the number ofdecisions made in a sequence). • Dynamic actions refers to the problem where the set of recommendations may be changing over time. This isa classic problem in marketing where the offers made at some retail shop could very well be slowly changingover time. Another example is movie recommendation, in businesses such as Netﬂix, where the catalogue ofmovies evolves over time.In this paper we address all of the above research challenges. We summarize in chronological order our work inmaking SR systems practical for the real-world. In Section 3 we present a method for evaluating SR systems off-linewith high conﬁdence. In Section 4 we present a practical reinforcement learning (RL) algorithm for implementing anSR system with an application to ad offers. In Section 5 we present an algorithm for safely deploying an SR system. In2

PREPRINT - S

EPTEMBER

17, 2020Section 6 we tackle the problem of non-stationarity. Technologies in sections 3, 4, 5 and 6 are build chronologically,where high conﬁdence off-policy evaluation is leveraged across all of them.In Section 7 we address the problem of bootstrapping an SR system from passive sequential data that do not containpast recommendations. In Section 8 we examine recommendation acceptance factors, such as the ‘propensity to listen’and ‘recommendation fatigue’. In Section 9 we describe a solution that can optimize for resource constraints in multi-user SR systems. Sections 7, 8 and 9 are build chronologically, where the bootstrapping from passive data is usedacross.In Section 10 we describe a solution to the large action space in SR systems. In Section 11 we describe a solution to thedynamic action problem, where the available actions can vary over time. Sections 10 and 11 are build chronologically,where they use same action embedding technology.Finally, in Section 12 we argue that the next generation of recommendation systems needs to incorporate humancognitive biases.

In this section, we present the general set of notations, which will be useful throughout the paper. In cases whereproblem speciﬁc notations are required, we introduce them in the respective section.We model SR systems as

Markov decision process (MDPs) [60]. An MDP is represented by a tuple, M =( S , A , P , R , γ, d ) . S is the set of all possible states, called the state set, and A is a ﬁnite set of actions, calledthe action set. The random variables, S t ∈ S , A t ∈ A , and R t denote the state, action, and reward at time t . Theﬁrst state comes from an initial distribution, d . The reward discounting parameter is given by γ ∈ [0 , . P is thestate transition function. We denote by s t the feature vector describing a user’s t th visit with the system and by a t the t th recommendation shown to the user, and refer to them as a state and an action . The rewards are assumed to benon-negative. The reward r t is if the user accepts the recommendation a t and , otherwise. We assume that the usersinteract at most T times. We write τ := { s , a , r , s , a , r , . . . , s T , a T , r T } to denote the history of visits with oneuser, and we call τ a trajectory . The return of a trajectory is the discounted sum of rewards, R ( τ ) := (cid:80) Tt =1 γ t − r t ,A policy π is used to determine the probability of showing each recommendation. Let π ( a | s ) denote the probabilityof taking action a in state s , regardless of the time step t . The goal is to ﬁnd a policy that maximizes the expected totalnumber of recommendation acceptances per user: ρ ( π ) := E [ R ( τ ) | π ] . Our historical data is a set of trajectories, oneper user. Formally, D is the historical data containing n trajectories { τ i } ni =1 , each labeled with the behavior policy π i that produced it. One of the ﬁrst challenges in building SR systems is evaluating their performance before costly A/B testing anddeployment. Unlike classical machine learning systems, an SR system is more complicated to evaluate because rec-ommendations can affect how a user responds to all future recommendations. In this section we summarize a highconﬁdence off-policy evaluation (HCOPE) method [68], which can inform the business manager of the performanceof the SR system with some guarantee, before the system is deployed. We denote the policy to be evaluated as the evaluation policy π e .HCOPE is a family of methods that use the historical data D in order to compute a − δ -conﬁdence lower bound onthe expected performance of the evaluation policy π e [68]. In this section, we explain three different approaches toHCOPE. All these approaches are based on importance sampling. The importance sampling estimator ˆ ρ ( π e | τ i , π i ) := R ( τ i ) (cid:124) (cid:123)(cid:122) (cid:125) return T (cid:89) t =1 π e ( a τ i t | s τ i t ) π i ( a τ i t | s τ i t ) (cid:124) (cid:123)(cid:122) (cid:125) importance weight , (1)is an unbiased estimator of ρ ( π ) if τ i is generated using policy π i [53], the support of π e is a subset of the support or π i , and where a τ i t and s τ i t denote the state and action in trajectory τ i respectively. Although the importance samplingestimator is conceptually easier to understand, in most of our applications we use the per-step importance samplingestimator ˆ ρ ( π e | τ i , π i ) := T (cid:88) t =1 γ t − r t  t (cid:89) j =1 π e ( a τ i j | s τ i j ) π i ( a τ i j | s τ i j )  , (2)3 PREPRINT - S

EPTEMBER

17, 2020 E m p i r i c a l E rr o r R a t e n CITTBCa

Figure 1: Empirical error rates when estimating a conﬁdence lower-bound on the mean of a gamma distribution(shape parameter k = 2 and scale parameter θ = 50 ) using ρ †− , where the legend speciﬁes the value of † . Thisgamma distribution has a heavy upper-tail similar to that of importance weighted returns. The logarithmically scaledhorizontal axis is the number of samples used to compute the lower bound (from to ) and the vertical axis isthe mean empirical error rate over , trials. Note that CI is overly conservative, with zero error in all the trials(it is on the x -axis). The t -test is initially conservative, but approaches the allowed error rate as the number ofsamples increases. BCa remains around the correct error rate regardless of the number of samples.where the term in the parenthesis is the importance weight for the reward generated at time t . This estimator has alower variance than (1), and remains unbiased.For brevity, we describe the approaches to HCOPE in terms of a set of non-negative independent random variables, X = { X i } ni =1 (note that the importance weighted returns are non-negative because the rewards are never negative,since in our applications the reward is when the user accepts a recommendation and otherwise). For our applica-tions, we will use X i = ˆ ρ ( π e | τ i , π i ) , where ˆ ρ ( π e | τ i , π i ) is computed either by (1) or (2). The three approaches thatwe will use are:

1. Concentration Inequality:

Here we use the concentration inequality (CI) in [68] and call it the

CI approach . Wewrite ρ CI − ( X , δ ) to denote the − δ conﬁdence lower-bound produced by their method. The beneﬁt of this method isthat it provides a true high-conﬁdence lower-bound, i.e., it makes no false assumption or approximation, and so werefer to it as safe . However, as it makes no assumptions, bounds obtained using CI happen to be overly conservative,as shown in Figure 1.

2. Student’s t -test: One way to tighten the lower-bound produced by the CI approach is to introduce a false butreasonable assumption. Speciﬁcally, we leverage the central limit theorem, which says that ˆ X := n (cid:80) ni =1 X i isapproximately normally distributed if n is large. Under the assumption that ˆ X is normally distributed, we may applythe one-tailed Student’s t -test to produce ρ TT − ( X , δ ) , a − δ conﬁdence lower-bound on E [ ˆ X ] , which in our applicationis a − δ conﬁdence lower-bound on ρ ( π e ) . Unlike the other two approaches, this approach, which we call TT , requireslittle space to be formally deﬁned, and so we present its formal speciﬁcation: ˆ X := 1 n n (cid:88) i =1 X i , ˆ σ := (cid:118)(cid:117)(cid:117)(cid:116) n − n (cid:88) i =1 (cid:16) X i − ˆ X (cid:17) , (3) ρ TT − ( X , δ ) := ˆ X − ˆ σ √ n t − δ,n − , (4)where t − δ,ν denotes the inverse of the cumulative distribution function of the Student’s t distribution with ν degreesof freedom, evaluated at probability − δ (i.e., function tinv (1 − δ, ν ) in M ATLAB ).Because ˆ ρ TT − is based on a false (albeit reasonable) assumption, we refer to it as semi-safe . Although the TT approach produces tighter lower-bounds than the CI’s, it still tends to be overly conservative for our application, as shown inFigure 1. More discussion can be found in the work by [68].

3. Bias Corrected and Accelerated Bootstrap:

One way to correct for the overly-conservative nature of TT is touse bootstrapping to estimate the true distribution of ˆ X , and to then assume that this estimate is the true distributionof ˆ X . The most popular such approach is Bias Corrected and accelerated (BCa) bootstrap [18]. We write ρ BCa − ( X , δ ) to denote the lower-bound produced by BCa, whose pseudocode can be found in [69]. While the bounds producedby BCa are reliable, like t-test it may have error rates larger than δ and are thus semi-safe . An illustrative example isprovided in Figure 1.For SR systems, where ensuring quality of a system before deployment is critical, these three approaches provideseveral viable approaches to obtaining performance guarantees using only historical data.4 PREPRINT - S

EPTEMBER

17, 2020

The next question is how to compute a good SR policy. In this section we demonstrate how to compute an SRpolicy for personalized ad recommendation (PAR) systems using reinforcement learning (RL). RL algorithms takeinto account the long-term effect of actions, and thus, are more suitable than myopic techniques, such as contextualbandits, for modern PAR systems in which the number of returning visitors is rapidly growing. However, whilemyopic techniques have been well-studied in PAR systems, the RL approach is still in its infancy, mainly due to twofundamental challenges: how to compute a good RL strategy and how to evaluate a solution using historical datato ensure its ‘safety’ before deployment. In this section, we use the family of off-policy evaluation techniques withstatistical guarantees presented in Section 3 to tackle both of these challenges. We apply these methods to a real PARproblem, both for evaluating the ﬁnal performance and for optimizing the parameters of the RL algorithm. Our resultsshow that an RL algorithm equipped with these off-policy evaluation techniques outperforms the myopic approaches.Our results give fundamental insights on the difference between the click through rate (CTR) and life-time value (LTV)metrics for evaluating the performance of a PAR algorithm [64].

Any personalized ad recommendation (PAR) policy could be evaluated for its greedy/myopic or long-term perfor-mance. For greedy performance, click through rate (CTR) is a reasonable metric, while life-time value (LTV) seemsto be the right choice for long-term performance. These two metrics are formally deﬁned asCTR = Total of ClicksTotal of Visits × , (5)LTV = Total of ClicksTotal of Visitors × . (6)CTR is a well-established metric in digital advertising and can be estimated from historical data (off-policy) in unbi-ased (inverse propensity scoring; [38]) and biased (see e.g., [58]) ways. The reason that we use LTV is that CTR isnot a good metric for evaluating long-term performance and could lead to misleading conclusions. Imagine a greedyadvertising strategy at a website that directly displays an ad related to the ﬁnal product that a user could buy. Forexample, it could be the BMW website and an ad that offers a discount to the user if she buys a car. users who arepresented such an offer would either take it right away or move away from the website. Now imagine another mar-keting strategy that aims to transition the user down a sales funnel before presenting her the discount. For example,at the BMW website one could be ﬁrst presented with an attractive ﬁnance offer and a great service department dealbefore the ﬁnal discount being presented. Such a long-term strategy would incur more visits with the user and wouldeventually produce more clicks per user and more purchases. The crucial insight here is that the policy can change thenumber of times that a user will be shown an advertisement—the length of a trajectory depends on the actions that arechosen. A visualization of this concept is presented in Figure 2. For greedy optimization, we used a random forest (RF) algorithm [11] to learn a mapping from features to actions. RFis a state-of-the-art ensemble learning method for regression and classiﬁcation, which is relatively robust to overﬁttingand is often used in industry for big data problems. The system is trained using a RF for each of the offers/actions topredict the immediate reward. During execution, we use an (cid:15) -greedy strategy, where we choose the offer whose RFhas the highest predicted value with probability − (cid:15) , and the rest of the offers, each with probability (cid:15)/ ( | A | − For LTV optimization, we used the Fitted Q Iteration (FQI) [20] algorithm, with RF function approximator, whichallows us to handle high-dimensional continuous and discrete variables. When an arbitrary function approximator isused in the FQI algorithm, it does not converge monotonically, but rather oscillates during training iterations. To alle-viate the oscillation problem of FQI and for better feature selection, we used our high conﬁdence off-policy evaluation(HCOPE) framework within the training loop. The loop keeps track of the best FQI result according to a validationdata set (see Algorithm 1).For both algorithms we start with three data sets an X train , X val and X test . Each one is made of complete user trajecto-ries. A user only appears in one of those ﬁles. The X val and X test contain users that have been served by the randompolicy. The greedy approach proceeds by ﬁrst doing feature selection on the X train , training a random forest, turningthe policy into (cid:15) -greedy on the X test and then evaluating that policy using the off-policy evaluation techniques. TheLTV approach starts from the random forest model of the greedy approach. It then computes labels as shown is step 65 PREPRINT - S

EPTEMBER

17, 2020

CTR=0.5 LTV=0.5 CTR=6/17=0.35 LTV=6/4=1.5 Policy 1 Policy 2

Figure 2: The circles indicate user visits. The black circles indicate clicks. Policy 1 is greedy and users do notreturn. Policy 2 optimizes for the long-run, users come back multiple times, and click towards the end. Even thoughPolicy 2 has a lower CTR than Policy 1, it results in more revenue, as captured by the higher LTV. Hence, LTV ispotentially a better metric than CTR for evaluating ad recommendation policies.of the LTV optimization algorithm 1. It does feature selection, trains a random forest model, and then turns the policyinto (cid:15) -greedy on the X val data set. The policy is tested using the importance weighted returns according to Equation2. LTV optimization loops over a ﬁxed number of iterations and keeps track of the best performing policy, which isﬁnally evaluated on the X test . The ﬁnal outputs are ‘risk plots’, which are graphs that show the lower-bound of theexpected sum of discounted reward of the policy for different conﬁdence values. Algorithm 1 L TV O PTIMIZATION ( X train , X val , X test , δ, K, γ, (cid:15) ) : compute a LTV strategy using X train , and predict the − δ lower bound on the test data X test . π b = randomPolicy Q = RF .G REEDY ( X train , X test , δ ) { start with greedy value function } for i = 1 to K do r = X train ( reward ) { use recurrent visits } x = X train ( features ) y = r t + γ max a ∈ A Q a ( x t +1 ) ¯ x = informationGain ( x, y ) { feature selection } Q a = randomForest (¯ x, y ) { for each action } π e = epsilonGreedy ( Q, X val ) W = ˆ ρ ( π e | X val , π b ) { importance weighted returns } currBound = ρ †− ( W, δ ) if currBound > prevBound then prevBound = currBound Q best = Q end if end for π e = epsilonGreedy ( Q best , X test ) W = ˆ ρ ( π e | X test , π b ) return ρ †− ( W, δ ) { lower bound } For our experiments we used 2 data sets from the banking industry. On the bank website when users visit, they areshown one of a ﬁnite number of offers. The reward is 1 when a user clicks on the offer and 0, otherwise. For data set 1,we collected data from a particular campaign of a bank for a month that had 7 offers and approximately , visits.About , of the visits were produced by a random strategy. For data set 2 we collected data from a different bankfor a campaign that had 12 offers and , , visits, out of which , were produced by a random strategy.When a user visits the bank website for the ﬁrst time, she is assigned either to a random strategy or a targeting strategyfor the rest of the campaign life-time. We splitted the random strategy data into a test set and a validation set. We used6 PREPRINT - S

EPTEMBER

17, 2020the targeting data for training to optimize the greedy and LTV strategies. We used aggressive feature selection for thegreedy strategy and selected of the features. For LTV, the feature selection had to be even more aggressive dueto the fact that the number of recurring visits is approximately . We used information gain for the feature selectionmodule [72]. With our algorithms we produce performance results both for the CTR and LTV metrics. To produceresults for CTR we assumed that each visit is a unique visitor. We performed various experiments to understand thedifferent elements and parameters of our algorithms. For all experiments we set γ = 0 . and (cid:15) = 0 . . Experiment 1: How do LTV and CTR compare?

For this experiment we show that every strategy has both a CTRand LTV metric as shown in Figure 3 (Left). In general the LTV metric gives higher numbers than the CTR metric.Estimating the LTV metric however gets harder as the trajectories get longer and as the mismatch with the behaviorpolicy gets larger. In this experiment the policy we evaluated was the random policy which is the same as the behaviorpolicy, and in effect we eliminated the importance weighted factor.

Performance0.04 0.05 0.06 C on f i den c e Performance0 0.005 0.01 0.015 C on f i den c e empiricalLTVBCATTESTCI Figure 3: (Left) This ﬁgure shows the bounds and empirical importance weighted returns for the random strategy. Itshows that every strategy has both a CTR and LTV metric. This was done for data set 1. (Right) This ﬁgure showscomparison between the 3 different bounds. It was done for data set 2.

Experiment 2: How do the three bounds differ?

In this experiment we compared the 3 different lower-boundestimation methods, as shown in Figure 3 (Right). We observed that the bound for the t -test is tighter than that forCI, but it makes the false assumption that importance weighted returns are normally distributed. We observed that thebound for BCa has higher conﬁdence than the t -test approach for the same performance. The BCa bound does notmake a Gaussian assumption, but still makes the false assumption that the distribution of future empirical returns willbe the same as what has been observed in the past. Experiment 3: When should each of the two optimization algorithms be used?

In this experiment we observedthat the G

REEDY O PTIMIZATION algorithm performs the best under the CTR metric and the L TV O PTIMIZATION algorithm performs the best under the LTV metric as expected, see Figure 4. The same claim holds for data set 2.

Experiment 4: What is the effect of (cid:15) ? One of the limitations of out algorithm is that it requires stochastic policies.The closer the new policy is to the behavior policy the easier to estimate the performance. Therefore, we approximateour policies with (cid:15) -greedy and use the random data for the behavior policy. The larger the (cid:15) , the easier is to get a moreaccurate performance of a new policy, but at the same time we would be estimating the performance of a sub-optimalpolicy, which has moved closer to the random policy, see Figure 5. Therefore, when using the bounds to compare twopolicies, such as Greedy vs. LTV, one should use the same (cid:15) . In the previous sections we described how to compute an SR in combination with high conﬁdence off-policy evaluationfor deployment with some guarantees. In the real world the deployment may need to happen incrementally, where atﬁxed intervals of time we would like to update the current SR policy in a safe manner. In this section we present abatch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policythat it proposes, and which has no hyper-parameter that requires expert tuning. Speciﬁcally, the user may select anyperformance lower-bound, ρ − , and conﬁdence level, δ , and our algorithm will ensure that the probability that it returnsa policy with performance below ρ − is at most δ . We then propose an incremental algorithm that executes our policy7 PREPRINT - S

EPTEMBER

17, 2020

Performance0.04 0.05 0.06 C on f i den c e C on f i den c e Figure 4: (Left) This ﬁgure compares the CTR bounds of the Greedy versus the LTV optimization It was done for dataset 1, but similar graphs exist for data set 2. (Right) This ﬁgure compare the LTV bounds of the Greedy versus theLTV optimization It was done for data set 1, but similar graphs exist for data set 2.

Performance C on f i den c e epsilon=0.5randomepsilon=0.1epsilon=0.1epsilon=0.5 Figure 5: The ﬁgure shows that as epsilon gets larger the policy moves towards the random policy. Random policesare easy to estimate their performance since they match the behavior policy exactly. Thus epsilon should be kept samewhen comparing two policies. This experiment was done on data set 2 and shows the bounds and empirical meanimportance weighted returns (vertical line) for the LTV policy. The bound used here was the CI.improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approachwith a digital marketing application that uses real world data [69].

Given a (user speciﬁed) lower-bound, ρ − , on the performance and a conﬁdence level, δ , we call an RL algorithm safe if it ensures that the probability that a policy with performance less than ρ − will be proposed is at most δ . The onlyassumption that a safe algorithm may make is that the underlying environment is a POMDP. Moreover, we require thatthe safety guarantee must hold regardless of how any hyperparameters are tuned.We call an RL algorithm semi-safe if it would be safe, except that it makes a false but reasonable assumption. Semi-safealgorithms are of particular interest when the assumption that the environment is a POMDP is signiﬁcantly strongerthan any (other) false assumption made by the algorithm, e.g., that the sample mean of the importance weighted returnsis normally distributed when using many trajectories.We call a policy, π , (as opposed to an algorithm) safe if we can ensure that ρ ( π ) ≥ ρ − with conﬁdence − δ . Note that“a policy is safe” is a statement about our belief concerning that policy given the observed data, and not a statementabout the policy itself.If there are many policies that might be deemed safe, then the policy improvement mechanism should return the onethat is expected to perform the best, i.e., π (cid:48) ∈ arg max safe π g ( π |D ) , (7)8 PREPRINT - S

EPTEMBER

17, 2020where g ( π |D ) ∈ R is a prediction of ρ ( π ) computed from D . We use a lower-variance, but biased, alternative toordinary importance sampling, called weighted importance sampling [53], for g , i.e., g ( π |D ) := (cid:80) |D| i =1 ˆ ρ ( π | τ D i , π D i ) (cid:80) |D| i =1 ˆ w ( τ D i , π, π D i ) . Note that even though Eq. (7) uses g , our safety guarantee is uncompromising—it uses the true (unknown and oftenunknowable) expected return, ρ ( π ) .In the following sections, we present batch and incremental policy improvement algorithms that are safe when theyuse the CI approach to HCOPE and semi-safe when they use the t -test or BCa approaches. Our algorithms have nohyperparameters that require expert tuning.In the following, we use the † symbol as a placeholder for either CI, TT, or BCa. We also overload the symbol ρ †− sothat it can take as input a policy, π , and a set of trajectories, D , in place of X , as follows: ρ †− ( π, D , δ, m ) := ρ †− (cid:16) |D| (cid:91) i =1 (cid:8) ˆ ρ (cid:0) π | τ D i , π D i (cid:1)(cid:9)(cid:124) (cid:123)(cid:122) (cid:125) X , δ, m (cid:17) . (8)For example, ρ BCa − ( π, D , δ, m ) is a prediction made using the data set D of what the − δ conﬁdence lower-bound on ρ ( π ) would be, if computed from m trajectories by BCa. Our proposed batch (semi-)safe policy improvement algorithm, P

OLICY I MPROVEMENT †‡ , takes as input a set of tra-jectories labeled with the policies that generated them, D , a performance lower bound, ρ − , and a conﬁdence level, δ ,and outputs either a new policy or N O S OLUTION F OUND (NSF). The meaning of the ‡ subscript will be describedlater.When we use D to both search the space of policies and perform safety tests, we must be careful to avoid the multiplecomparisons problem [7]. To make this important problem clear, consider what would happen if our search of policyspace included only two policies, and used all of D to test both of them for safety. If at least one is deemed safe, thenwe return it. HCOPE methods can incorrectly label a policy as safe with probability at most δ . However, the systemwe have described will make an error whenever either policy is incorrectly labeled as safe, which means its error ratecan be as large as δ . In practice the search of policy space should include many more than just two policies, whichwould further increase the error rate.We avoid the multiple comparisons problem by setting aside data that is only used for a single safety test that deter-mines whether or not a policy will be returned. Speciﬁcally, we ﬁrst partition the data into a small training set, D train ,and a larger test set, D test . The training set is used to search for which single policy, called the candidate policy , π c ,should be tested for safety using the test set. This policy improvement method, P OLICY I MPROVEMENT †‡ , is reportedin Algorithm 2. To simplify later pseudocode, P OLICY I MPROVEMENT †‡ assumes that the trajectories have alreadybeen partitioned into D train and D test . In practice, we place / of the trajectories in the training set and the remainderin the test set. Also, note that P OLICY I MPROVEMENT †‡ can use the safe concentration inequality approach, † = CI, orthe semi-safe t -test or BCa approaches, † ∈ { TT, BCa } .P OLICY I MPROVEMENT †‡ is presented in a top-down manner in Algorithm 2, and makes use of theG ET C ANDIDATE P OLICY †‡ ( D , δ, ρ − , m ) method, which searches for a candidate policy. The input m speciﬁes thenumber of trajectories that will be used during the subsequent safety test. Although G ET C ANDIDATE P OLICY †‡ couldbe any batch RL algorithm, like LSPI or FQI [37, 20], we propose an approach that leverages our knowledge thatthe candidate policy must pass a safety test. We will present two versions of G ET C ANDIDATE P OLICY †‡ , which wedifferentiate between using the subscript ‡ , which may stand for None or k -fold.Before presenting these two methods, we deﬁne an objective function f † as: f † ( π, D , δ, ρ − , m ) := (cid:40) g ( π |D ) if ρ †− ( π, D , δ, m ) ≥ ρ − ,ρ †− ( π, D , δ, m ) otherwise.9 PREPRINT - S

EPTEMBER

17, 2020Intuitively, f † returns the predicted performance of π if the predicted lower-bound on ρ ( π ) is at least ρ − , and thepredicted lower-bound on ρ ( π ) , otherwise.Consider G ET C ANDIDATE P OLICY † None , which is presented in Algorithm 3. This method uses all of the availabletraining data to search for the policy that is predicted to perform the best, subject to it also being predicted to pass thesafety test. That is, if no policy is found that is predicted to pass the safety test, it returns the policy, π , that it predictswill have the highest lower bound on ρ ( π ) . If policies are found that are predicted to pass the safety test, it returns onethat is predicted to perform the best (according to g ).The beneﬁts of this approach are its simplicity and that it works well when there is an abundance of data. However,when there are few trajectories in D (e.g., cold start), this approach has a tendency to overﬁt—it ﬁnds a policy thatit predicts will perform exceptionally well and which will easily pass the safety test, but actually fails the subsequentsafety test in P OLICY I MPROVEMENT † None . We call this method ‡ = None because it does not use any methods to avoidoverﬁtting. Algorithm 2 P OLICY I MPROVEMENT †‡ ( D train , D test , δ, ρ − ) Either returns N O S OLUTION F OUND (NSF) or a (semi-)safe policy.Here † can denote either CI, TT, or BCa.1: π c ← G ET C ANDIDATE P OLICY †‡ ( D train , δ, ρ − , |D test | ) if ρ †− ( π c , D test , δ, |D test | ) ≥ ρ − then return π c return NSF

Algorithm 3 G ET C ANDIDATE P OLICY † None ( D , δ, ρ − , m ) Searches for the candidate policy, but does nothing to mitigate overﬁtting.1: return arg max π f † ( π, D , δ, ρ − , m ) In machine learning, it is common to introduce a regularization term, α (cid:107) w (cid:107) , into the objective function in order toprevent overﬁtting. Here w is the model’s weight vector and (cid:107)·(cid:107) is some measure of the complexity of the model (often L or squared L -norm), and α is a parameter that is tuned using a model selection method like cross-validation. Thisterm penalizes solutions that are too complex, since they are likely to be overﬁtting the training data.Here we can use the same intuition, where we control for the complexity of the solution policy using a regularizationparameter, α , that is optimized using k -fold cross-validation. Just as the squared L -norm relates the complexity of aweight vector to its squared distance from the zero vector, we deﬁne the complexity of a policy to be some notion of itsdistance from the initial policy, π . In order to allow for an intuitive meaning of α , rather than adding a regularizationterm to our objective function, f † ( · , D train , δ, ρ − , |D test | ) , we directly constrain the set of policies that we search overto have limited complexity.We achieve this by only searching the space of mixed policies µ α,π ,π , where µ α,π ,π ( a | s ) := απ ( a | s ) + (1 − α ) π .Here, α is the ﬁxed regularization parameter, π ( a | s ) is the ﬁxed initial policy, and we search the space of all possible π . Consider, for example what happens to the probability of action a in state s when α = 0 . . If π ( a | s ) = 0 . , thenfor any π , we have that µ α,π ,π ( a | s ) ∈ [0 . , . . That is, the mixed policy can only move of the way towardsbeing deterministic (in either direction). In general, α denotes that the mixed policy can change the probability of anaction no more than α % towards being deterministic. So, using mixed policies results in our searches of policyspace being constrained to some feasible set centered around the initial policy, and where α scales the size of thisfeasible set.While small values of α can effectively eliminate overﬁtting by precluding the mixed policy from moving far awayfrom the initial policy, they also limit the quality of the best mixed policy in the feasible set. It is therefore importantthat α is chosen to balance the tradeoff between overﬁtting and limiting the quality of solutions that remain in thefeasible set. Just as in machine learning, we use k -fold cross-validation to automatically select α .This approach is provided in Algorithm 4, where C ROSS V ALIDATE † ( α, D , δ, ρ − , m ) uses k -fold cross-validation topredict the value of f † ( π, D test , δ, ρ − , |D test | ) if π were to be optimized using D train and regularization parameter α .C ROSS V ALIDATE † is reported in Algorithm 5. In our implementations we use k = min { , |D|} folds. The P

OLICY I MPROVEMENT †‡ algorithm is a batch method that can be applied to an existing data set, D . However, itcan also be used in an incremental manner by executing new safe policies whenever they are found. The user mightchoose to change ρ − at each iteration, e.g., to reﬂect an estimate of the performance of the best policy found so far or10 PREPRINT - S

EPTEMBER

17, 2020

Algorithm 4 G ET C ANDIDATE P OLICY † k -fold ( D , δ, ρ − , m ) Searches for the candidate policy using k -fold cross-validation to avoidoverﬁtting.1: α (cid:63) ← arg max α ∈ [0 , C ROSS V ALIDATE † ( α, D , δ, ρ − , m ) π (cid:63) ← arg max π f † ( µ α (cid:63) ,π ,π , D , δ, ρ − , m ) return µ α (cid:63) ,π ,π (cid:63) Algorithm 5 C ROSS V ALIDATE † ( α, D , δ, ρ − , m )

1: Partition D into k subsets, D , . . . , D k , of approximately the same size.2: result ← for i = 1 to k do (cid:98) D ← (cid:83) j (cid:54) = i D j π (cid:63) ← arg max π f † ( µ α,π ,π , (cid:98) D , δ, ρ − , m )

6: result ← result + f † ( µ α,π ,π (cid:63) , D i , δ, ρ − , m ) end for return result /k the most recently proposed policy. However, for simplicity in our pseudocode and experiments, we assume that theuser ﬁxes ρ − as an estimate of the performance of the initial policy. This scheme for selecting ρ − is appropriate whentrying to convince a user to deploy an RL algorithm to tune a currently ﬁxed initial policy, since it guarantees withhigh conﬁdence that it will not decrease performance.Our algorithm maintains a list, C , of the policies that it has deemed safe. When generating new trajectories, it alwaysuses the policy in C that is expected to perform best. C is initialized to include a single initial policy, π , which is thesame as the baseline policy used by G ET C ANDIDATE P OLICY † k -fold . This online safe learning algorithm is presentedin Algorithm 6. It takes as input an additional constant, β , which denotes the number of trajectories to be generatedby each policy. If β is not already speciﬁed by the application, it should be selected to be as small as possible,while allowing D AEDALUS †‡ to execute within the available time. We name this algorithm D AEDALUS †‡ after themythological character who promoted safety when he encouraged Icarus to use caution. Algorithm 6 D AEDALUS †‡ ( π , δ, ρ − , β ) Incremental policy improvement algorithm.1:

C ← { π } D train ← D test ← ∅ while true do (cid:98) D ← D train π (cid:63) ← arg max π ∈C g ( π | (cid:98) D )

6: Generate β trajectories using π (cid:63) and append (cid:100) β/ (cid:101) to D train and the rest to D test π c ← P OLICY I MPROVEMENT †‡ ( D train , D test , δ, ρ − ) (cid:98) D ← D train if π c (cid:54) = NSF and g ( π c | (cid:98) D ) > max π ∈C g ( π | (cid:98) D ) then C ← C ∪ π c D test ← ∅ end if end while The beneﬁts of ‡ = k -fold are biggest when only a few trajectories are available, since thenG ET C ANDIDATE P OLICY † None is prone to overﬁtting. When there is a lot of data, overﬁtting is not a big problem,and so the additional computational complexity of k -fold cross-validation is not justiﬁed. In our implementations ofD AEDALUS † k -fold , we therefore only use ‡ = k -fold until the ﬁrst policy is successfully added to C , and ‡ = None there-after. This provides the early beneﬁts of k -fold cross-validation without incurring its full computational complexity.The D AEDALUS †‡ algorithm ensures safety with each newly proposed policy. That is, during each iteration of thewhile-loop, the probability that a new policy, π , where ρ ( π ) < ρ − , is added to C is at most δ . The multiple comparisonproblem is not relevant here because this guarantee is per-iteration. However, if we consider the safety guaranteeover multiple iterations of the while-loop, it applies and means that the probability that at least one policy, π , where ρ ( π ) < ρ − , is added to C over k iterations is at most min { , kδ } . If trajectories are available a priori , then D train , D test , and C can be initialized accordingly. PREPRINT - S

EPTEMBER

17, 2020We deﬁne D

AEDALUS †‡ to be D AEDALUS †‡ but with line 11 removed. The multiple hypothesis testing problem does not affect D AEDALUS †‡ more than D AEDALUS †‡ , since the safety guarantee is per-iteration. However, a more subtleproblem is introduced: the importance weighted returns from the trajectories in the testing set, ˆ ρ ( π c | τ D test i , π D test i ) ,are not necessarily unbiased estimates of ρ ( π c ) . This happens because the policy, π c , is computed in part from thetrajectories in D test that are used to test it for safety. This dependence is depicted in Figure 6. We also modifyD AEDALUS †‡ by changing lines 4 and 8 to (cid:98) D ← D train ∪ D test , which introduces an additional minor dependence of π c on the trajectories in D j test . (cid:2024) (cid:3030)(cid:2869) (cid:1830) (cid:2930)(cid:2928)(cid:2911)(cid:2919)(cid:2924)(cid:2869) (cid:1830) (cid:2930)(cid:2915)(cid:2929)(cid:2930)(cid:2869) (cid:2024) (cid:2868) Safety Test (cid:4666)(cid:2024) (cid:3030)(cid:2869) (cid:4667) (cid:2024) (cid:2870) (cid:2024) (cid:3030)(cid:2870) (cid:1830) (cid:2930)(cid:2928)(cid:2911)(cid:2919)(cid:2924)(cid:2870) (cid:1830) (cid:2930)(cid:2915)(cid:2929)(cid:2930)(cid:2870)

Safety Test (cid:4666)(cid:2024) (cid:3030)(cid:2870) (cid:4667)(cid:2024) (cid:2869)

Figure 6: This diagram depicts inﬂuences as D

AEDALUS †‡ runs. First, π is used to generate sets of trajectories, D train and D test , where superscripts denote the iteration. Next D train is used to select the candidate policy, π c . Next, π c istested for safety using the trajectories in D test (this safety test occurs on line 2 of P OLICY I MPROVEMENT †‡ ). The resultof the safety test inﬂuences which policy, π , will be executed next. These policies are then used to produce D train and D test as before. Next, both D train and D train are used to select the candidate policy, π c . This policy is then testedfor safety using the trajectories in D test and D test . The result of this test inﬂuences which policy, π , will be executednext, and the process continues. Notice that D test is used when testing π c for safety (as indicated by the dashed blueline) even though it also inﬂuences π c (as indicated by the dotted red path). This is akin to performing an experiment,using the collected data ( D test ) to select a hypothesis ( π c is safe), and then using that same data to test the hypothesis.D AEDALUS †‡ does not have this problem because the dashed blue line is not present.Although our theoretical analysis applies to D AEDALUS †‡ , we propose the use of D AEDALUS †‡ because the ability ofthe trajectories, D i test , to bias the choice of which policy to test for safety in the future ( π jc , where j > i ) towards apolicy that D i test will deem safe, is small. However, the beneﬁts of D AEDALUS †‡ over D AEDALUS †‡ are signiﬁcant—the set of trajectories used in the safety tests increases in size with each iteration, as opposed to always being of size β . So, in practice, we expect the over-conservativeness of ρ CI − to far outweigh the error introduced by D AEDALUS †‡ .Notice that D AEDALUS CI ‡ is safe (not just semi-safe) if we consider its execution up until the ﬁrst change of thepolicy, since then the trajectories are always generated by π , which is not inﬂuenced by any of the testing data. For our case study we used real data, captured with permission from the website of a Fortune 50company that receives hundreds of thousands of visitors per day and which uses Adobe Target, to train a simulatorusing a proprietary in-house system identiﬁcation tool at Adobe. The simulator produces a vector of real-valuedfeatures that provide a compressed representation of all of the available information about a user. The advertisementsare clustered into two high-level classes that the agent must select between. After the agent selects an advertisement,the user either clicks (reward of +1 ) or does not click (reward of ) and the feature vector describing the user is updated.Although this greedy approach has been successful, as we discussed in Section 4, it does not necessarily also maximizethe total number of clicks from each user over his or her lifetime. Therefore, we consider a full reinforcement learningsolution for this problem. We selected T = 20 and γ = 1 . This is a particularly challenging problem because thereward signal is sparse. If each action is selected with probability . always, only about . of the transitionsare rewarding, since users usually do not click on the advertisements. This means that most trajectories provide nofeed-back. Also, whether a user clicks or not is close to random, so returns have relatively high variance. We generateddata using an initial baseline policy and then evaluated a new policy proposed by an in-house reinforcement learningalgorithm.In order to avoid the large costs associated with deployment of a bad policy, in this application it is imperative thatnew policies proposed by RL algorithms are ensured to be safe before deployment.12 PREPRINT - S

EPTEMBER

17, 2020 E x pe c t ed N o r m a li z ed R e t u r n Number of Episodes E x pe c t ed N o r m a li z ed R e t u r n Number of Trajectories None, CINone, TTNone, BCak-fold, CIk-fold, TTk-fold, BCa

Figure 7: Performance of D

AEDALUS †‡ on the digital marketing domain. The legend speciﬁes ‡ , † . Results:

In our experiments, we selected ρ − to be an empirical estimate of the performance of the initial policy and δ = 0 . . We used CMA-ES [24] to solve all arg max π , where π was parameterized by a vector of policy parametersusing linear softmax action selection [60] with the Fourier basis [35].For our problem domain, we executed D AEDALUS †‡ with † ∈ { CI, TT, BCa } and ‡ ∈ { None, k -fold } . Ideally, wewould use β = 1 for all domains. However, as β decreases, the runtime increases. We selected β ∈ [50 , , for the digital marketing domain. β increases with the number of trajectories in the digital marketing domain sothat the plot can span the number of trajectories required by the CI approach without requiring too many calls to thecomputationally demanding P OLICY I MPROVEMENT

BCa k -fold method. We did not tune β for these experiments—it wasset solely to limit the runtime.The performance of D AEDALUS †‡ on the digital marketing domain is provided in Figure 7. The expected normalizedreturns in Figure 7 are computed using , Monte Carlo rollouts, respectively. The curves are also averaged over trials, respectively, with standard error bars provided when they do not cause too much clutter.First, consider the different values for † . As expected, the CI approaches (solid curves) are the most conservative, andtherefore require the most trajectories in order to guarantee improvement. The BCa approaches (dashed lines) performthe best, and are able to provide high-conﬁdence guarantees of improvement with as few as trajectories. The TTapproach (dotted lines) perform in-between the CI and BCa approaches, as expected (since the t -test tends to produceoverly conservative lower bounds for distributions with heavy upper tails).Next, consider the different values of ‡ . Using k -fold cross-validation provides an early boost in performance by limit-ing overﬁtting when there are few trajectories in the training set. Although the results are not shown, we experimentedwith using ‡ = k -fold for the entire runtime (rather than just until the ﬁrst policy improvement), but found that whileit did increase the runtime signiﬁcantly, it did not produce much improvement. In the previous sections we made a critical assumption that the domain can be modeled as a POMDP. However, realworld problems are often non-stationary. In this section we consider the problem of evaluating an SR policy off-linewithout assuming stationary transition and rewards. We argue that off-policy policy evaluation for non-stationaryMDPs can be phrased as a time series prediction problem, which results in predictive methods that can anticipatechanges before they happen. We therefore propose a synthesis of existing off-policy policy evaluation methods withexisting time series prediction methods, which we show results in a drastic reduction of mean squared error whenevaluating policies using real digital marketing data set [70].

In digital marketing applications, when a person visits the website of a company, she is often shown a list of currentpromotions. In order for the display of these promotions to be effective, it must be properly targeted based on theknown information about the person (e.g., her interests, past travel behavior, or income). The problem now reduces toautomatically deciding which promotion (sometimes called a campaign ) to show to the visitor of a website.13

PREPRINT - S

EPTEMBER

17, 2020

Episode number: hours R e v enue : C T R . . . Figure 8: Plot of

OPE( π e , ι | D ) for various ι , on the real-world digital marketing data from a large company in thehotel and entertainment industry. This data spans several days. Since the raw data has high variance, we bin the datainto bins that each span one hour. Notice that the performance of the policy drops from an initial CTR of . down toa near-zero CTR near the middle of the data set.As we have described in Section 4 the system’s goal is to determine how to select actions (select promotions to display)based on the available observations (the known information of the visitor) such that the reward is maximized (thenumber of clicks is maximized). Let ρ ( π e , ι ) be the performance of the policy π e in episode ι . In the bandit setting ρ ( π e , ι ) is the expected number of clicks per visit , called the click through rate (CTR), while in the reinforcementlearning setting it is the expected number of clicks per user , called the life-time value (LTV).In order to determine how much of a problem non-stationarity really is, we collected data from the website of oneof Adobe’s Test and Target customers: the website of a large company in the hotel and entertainment industry. Wethen used a proprietary policy search algorithm custom designed for digital marketing to generate a new policy for thecustomer. We then collected n ≈ , new episodes of data, which we used as D , to compute OPE( π e , ι | D ) forall ι ∈ { , . . . , n − } using ordinary importance sampling. Figure 8 summarizes the resulting data.In this data it is evident that there is signiﬁcant non-stationarity—the CTR varied drastically over the span of the plot.This is also not just an artifact of high variance: using Student’s t -test we can conclude that the expected return duringthe ﬁrst , and subsequent , episodes was different with p = 1 . × − . This is compelling evidence thatwe cannot ignore non-stationarity in our users’ data when providing predictions of the expected future performanceof our digital marketing algorithms, and is compelling real-world motivation for developing non-stationary off-policypolicy evaluation algorithms. Non-stationary Off-Policy Policy Evaluation (NOPE) is simply OPE for non-stationary MDPs. In this setting, the goalis to use the available data D to estimate ρ ( π e , n ) —the performance of π e during the next episode (the n th episode).Notice that we have not made assumptions about how the transition and reward functions of the non-stationary MDPchange. For some applications, they may drift slowly, making ρ ( π e , ι ) change slowly with ι . For example, this sort ofdrift may occur due to mechanical wear in a robot. For other applications, ρ ( π e , ι ) may be ﬁxed for some number ofepisodes, and then make a large jump. For example, this sort of jump may occur in digital marketing applications [64]due to media coverage of a relevant topic rapidly changing public opinion of a product. In yet other applications, theenvironment may include both large jumps and smooth drift.Notice that NOPE can range from trivial to completely intractable. If the MDP has few states and actions, changesslowly between episodes, and the evaluation policy is similar to the behavior policy, then we should be able to getaccurate off-policy estimates. On the other extreme, if for each episode the MDP’s transition and reward functions14 PREPRINT - S

EPTEMBER

17, 2020

Episode number, (cid:2017)

Figure 9: This illustration depicts an example of how the existing standard OPE methods produce reactive behavior,and is hand-drawn to provide intuition. Here the dotted blue line depicts ρ ( π e , ι ) for various ι . The black dots denote OPE( π e , ι | D ) for various ι . Notice that each OPE( π e , ι | D ) is a decent estimate of ρ ( π e , ι ) , which changes with ι .Our goal is to estimate ρ ( π e , n ) —the performance of the policy during the next episode. That is, our goal is to predictthe vertical position of the green circle. However, by averaging the OPE estimates, we get the red circle, which is areasonable prediction of performance in the past. As more data arrives ( n increases) the predictions will decrease, butwill always remain behind the target value of ρ ( π e , n ) .are drawn randomly (or adversarially) from a wide distribution, then producing accurate estimates of ρ ( π e , n ) may beintractable. The primary insight in this section, in retrospect, is obvious:

NOPE is a time series prediction problem.

Figure 9provides an illustration of the idea. Let x ι = ι and Y ι = OPE( π e , ι | D ) for ι ∈ { , . . . , n − } . This makes x an arrayof n times (each episode corresponds to one unit of time) and y an array of the corresponding n observations. Ourgoal is to predict the expected value of the next point in this time series, which will occur at x n = n . Pseudocode forthis time series prediction (TSP) approach is given in Algorithm 7. Algorithm 7

Time Series Prediction (TSP) Input:

Evaluation policy, π e , historical data, D := ( H ι , π ι ) n − ι =0 , and a time-series prediction algorithm (and itshyper-parameters). Create arrays x and y , both of length n . for ι = 0 to n − do x ι ← ι y ι ← OPE( π e , ι | D ) end for Train a time-series prediction algorithm on x, y . return the time-series prediction algorithm’s prediction for time n .When considering using time-series prediction methods for off-policy policy evaluation, it is important that we estab-lish that the underlying process is actually nonstationary. One popular method for determining whether a process isstationary or nonstationary is to report the sample autocorrelation function (ACF): ACF h := E [( X t + h − µ )( X t − µ )] E [( X t − µ ) ] , where h is a parameter called the lag (which is selected by the researcher), X t is the time series, and µ is the mean of thetime series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of nonstationarydata decreases slowly. 15 PREPRINT - S

EPTEMBER

17, 2020ARIMA models are models of time series data that can capture many different sources of non-stationarity. The timeseries prediction algorithm that we use in our experiments is the R forecast package for ﬁtting ARIMA models [28]. In this section we show that, despite the lack of theoretical results about using TSP for NOPE, it performs remarkablywell on real data. Because our experiments use real-world data, we do not know ground truth—we have

OPE( π e , ι | D ) for a series of ι , but we do not know ρ ( π e , ι ) for any ι . This makes evaluating our methods challenging—we cannot,for example, compute the true error or mean squared error of estimates. We therefore estimate the mean error andmean squared error directly from the data as follows.For each ι ∈ { , . . . , n − } we compute each method’s output, ˆ y ι , given all of the previous data, D ι − := ( H ˆ ι , π ˆ ι ) ι − ι =0 .We then compute the observed next value, y ι = OPE( π e , ι | D ι ) . From these, we compute the squared error, (ˆ y ι − y ι ) ,and we report the mean squared error over all ι . We perform this experiment using both the current standard OPEapproach, which computes sample mean of performance over all the available data, and using our new time seriesprediction approach.Notice that this scheme is not perfect. Even if an estimator perfectly predicts J ( π e , ι ) for every ι , it will be reported ashaving non-zero mean squared error. This is due to the high variance of OPE , which gets conﬂated with the varianceof ˆ y in our estimate of mean squared error. Although this means that the mean squared errors that we report are notgood estimates of the mean squared error of the estimators, ˆ y , the variance-conﬂation problem impacts all methodsnearly equally. So, in the absence of ground truth knowledge, the reported mean squared error values are a reasonablemeasure of how accurate the methods are relative to each other.The domain we consider is digital marketing using the data from the large companies in the hotel and entertainmentindustry as described in Section 6.1. We refer to this domain as the Hotel domain. For this domain, and all others,we used ordinary importance sampling for

OPE . Recall that the performance of the evaluation policy appears to dropinitially—the probability of a user clicking decays from a remarkably high down to a near-zero probability—before it rises back to close to its starting level. Recall also that using a two-sided Student’s t -test we found thatthe true mean during the ﬁrst , trajectories was different from the true mean during the subsequent , trajectories with p = 1 . × − , so the non-stationarity that we see is likely not noise.We collected additional data from the website of a large company in the ﬁnancial industry, and used the same propri-etary policy improvement algorithm to ﬁnd a new policy that we might consider deploying for the user. There appearsto be less long-term non-stationarity in this data, and a two-sided Student’s t -test did not detect a difference betweenthe early and late performance of the evaluation policy. We refer to this data as the Bank domain.

We applied our TSP algorithm for NOPE, described in Algorithm 7, to the nonstationary hotel and bank data sets. Theplots in Figure 10 all take the same form: the plots on the left are autocorrelation plots that show whether or not thereappears to be non-stationarity in the data. As a rule of thumb, if the ACF values are within the dotted blue lines, thenthere is not sufﬁcient evidence to conclude that there is non-stationarity. However, if the ACF values lie outside thedotted blue lines, it suggests that there is non-stationarity.The plots on the right depict the expected return (which is the expected CTR for the hotel and bank data sets) aspredicted by several different methods. The black curves are the target values—the observed mean OPE estimate overa small time interval. For each episode number, our goal is to compute the value of the black curve given all of theprevious values of the black curve. The blue curve does this using the standard method, which simply averages theprevious black points. The red curve is our newly proposed method, which uses ARIMA to predict the next pointon the black curve—to predict the performance of the evaluation policy during the next episode. Above the plots wereport the sample root mean squared error (RMSE) for our method, tsp , and the standard method, standard .Consider the results on the hotel data set, which are depicted in Figure 10 (Top). The red curve (our method) tracksthe binned values (black curve) much better than the blue curve (standard method). Also, the sample RMSE of ourmethod is . , which is lower than the standard method’s RMSE of . . This suggests that treating the problemas a time series prediction problem results in more accurate estimates.Finally, consider the results on the bank data set, which are depicted in Figure 10 (Bottom). The auto-correlation plotsuggests that there is not much non-stationarity in this data set. This validates another interesting use case for ourmethod: does it break down when the environment happens to be (approximately) stationary? The results suggest that16 PREPRINT - S

EPTEMBER

17, 2020 − . . . . . Lag A C F Series: hotel R e v enue : C T R . . . hotel RMSE: tsp= 0.025, standard=0.036 tspstandard − . − . . . . Lag A C F Series: bank R e v enue : C T R . . . . . bank RMSE: tsp= 0.012, standard=0.012 tspstandard Figure 10: (Top) Hotel domain. The left plot shows the auto-correlation for the time series, where it is obvious thesignal nonstationary. The right plot compares the tsp approach with the standard. The tsp outperforms the standardapproach, since the series is nonstationary. The time series was aggregated at the hour level. (Bottom) Bank domain.The left plot shows the autocorellation for the time series, where it is obvious the signal stationary. The right plotcompares the tsp approach with the standard. They both perform the same, since the series is stationary. The timeseries was aggregated at the hour level. 17

PREPRINT - S

EPTEMBER

17, 2020it does not—our method achieves the same RMSE as the standard method, and the blue and red curves are visuallyquite similar.An interesting research question is whether our high-conﬁdence policy evaluation and improvement algorithms can beextended to non-stationary MDPs. However, following TSP algorithm, it can be noticed that estimating performance ofa policy with high-conﬁdence in non-stationary MDP can be reduced to time-series forecasting with high-conﬁdence,which in complete generality is infeasible. An open research direction is to leverage domain speciﬁc structure of theproblem and identify conditions under which this problem can be made feasible.

Constructing SR systems is particularly challenging due to the cold start problem. Fortunately, in many real worldproblems, there is an abundance of sequential data which are usually ‘passive’ in that they do not include past recom-mendations. In this section we propose a practical approach that learns from passive data. We use scalar parameteri-zation that turns a passive model into active, and posterior sampling for Reinforcement learning (PSRL) to learn thecorrect parameter value. In this section we summarize our work from [65, 66].The idea is to ﬁrst learn a model from passive data that predicts the next activity given the history of activities. This canbe thought of as the no-recommendation or passive model. To create actions for recommending the various activities,we can perturb the passive model. Each perturbed model increases the probability of following the recommendations,by a different amount. This leads to a set of models, each one with a different ‘propensity to listen. In effect, the single‘propensity to listen parameter is used to turn a passive model into a set of active models. When there are multiplemodels one can use online algorithms, such as posterior sampling for Reinforcement learning (PSRL) to identify thebest model for a new user [59, 47]. In fact, we used a deterministic schedule PSRL (DS-PSRL) algorithm, for whichwe have shown how it satisﬁes the assumptions of our parameterization in [66]. The overall solution is shown inFigure 11.

1. Sequence modeling(probabilistic suffix trees)

Trajectories of different users

2. Actioncreation 5. Compute next a(PSRL)

𝑃(𝑠|𝑋) 𝑃(𝑠|𝑋, 𝑎, 𝜃) 𝑠, r 𝜗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝑃(𝜃)) History of sNext s RecommendationUser persona: propensity to listen

𝑃(𝑥 |𝑥, 𝑎, 𝜃)

3. MDP creation 4. MDPsolutions r (𝑥, 𝑎) 𝜋 (𝑥)𝑎 = 𝜋 (𝑥)𝑃 𝜃 ∝ 𝑃 𝑠 𝑋, 𝑎, 𝜃 𝑃(𝜃) Truncated historyof s policy

Figure 11: The ﬁrst 4 steps are done ofﬂine and are used to create and solve a discrete set of MDPs, for each value of θ . Step 5 implements the DS-PSRL algorithm of this paper. The ﬁrst step in the solution is to model past sequences of activities. Due to the fact that the number of activitiesis usually ﬁnite and discrete and the fact that what activity a person may do next next depends on the history of18

PREPRINT - S

EPTEMBER

17, 2020activities done, we chose to model activity sequences using probabilistic sufﬁx trees (PST). PSTs are a compact wayof modeling the most frequent sufﬁxes, or history of a discrete alphabet S (e.g., a set of activities). The nodes of a PSTrepresent sufﬁxes (or histories). Each node is associated with a probability distribution for observing every symbolin the alphabet, given the node sufﬁx [21]. Given a PST model one could easily estimate the probability of the nextsymbol s = s t +1 given the history of symbols X = ( s , s . . . s t ) as P ( s | X ) . An example PST is shown in Figure12. Figure 12: The ﬁgure describes an example probabilistic sufﬁx tree. The circles represent sufﬁxes. In this tree, thesufﬁxes are { (1), (3,1), (4,1), (2,4,1), (4) } . The rectangles show the probability of observing the next symbol given thesufﬁx. The alphabet, or total number of symbols in the example are { } The log likelihood of a set of sequences can easily be computed as log( L ) = (cid:80) s ∈ S log( P ( s | X )) , where s are allthe symbols appearing in the data and X maps the longest sufﬁxes (nodes) available in the tree for each symbol. Forour implementation we learned PSTs using the pstree algorithm from [21]. The pstree algorithm can take as inputmultiple parameters, such as the depth of the tree, the number of minimum occurrence of a sufﬁx, and parametrizedtree pruning methods. To select the best set of parameters we perform model selection using the Modiﬁed AkaikeInformation Criterion (AICc) AICc = 2 k − L ) + k ( k +1) n − k − , where log( L ) is the log likelihood as deﬁned earlierand k is the number of parameters [2] . The second step involves the creation of action models for representing various personas. An easy way to create suchparameterization, is to perturb the passive dynamics of the crowd PST (a global PST learned from all the data). Eachperturbed model increases the probability of listening to the recommendation by a different amount. While there couldbe many functions to increase the transition probabilities, in our implementation we did it as follows: P ( s | X, a, θ ) = (cid:26) P ( s | X ) /θ , if a = sP ( s | X ) /z ( θ ) , otherwise (9)where s is an activity, X = ( s , s . . . s t ) a history of activities, and z ( θ ) = (cid:80) s (cid:54) = a P ( s | X )1 − P ( s = a | X ) /θ is a normalizing factor. The third step is to create MDPs from P ( s | X, a, θ ) and compute their policies in the fourth step. It is straightforwardto use the PST to compute an MDP model, where the states/contexts are all the nodes of the PST. If we denote x to bea sufﬁx available in the tree, then we can compute the probability of transitioning from every node to every other nodeby ﬁnding resulting sufﬁxes in the tree for every additional symbol that an action can produce: p ( x (cid:48) | x, a, θ ) = (cid:88) s ∈ S { x (cid:48) = pst.sufﬁx ( x, s ) } p ( s | x, a, θ ) , where pst.sufﬁx ( x, s ) , is the longest sufﬁx in the PST of sufﬁx x concatenated with symbol s . We set the reward r ( x ) = f ( x, a ) , where f is a function of the sufﬁx history and the recommendation. This gives us a ﬁnite andpractically small state space. We can use the classic policy iteration algorithm to compute the optimal policies andvalue functions V ∗ θ ( x ) . 19 PREPRINT - S

EPTEMBER

17, 2020

The ﬁfth step is to use on-line learning to compute the true user parameters. For this we used a posterior samplingfor reinforcement learning (PSRL) algorithm called deterministic schedule PSRL (DS-PSRL) [66]. The DS-PSRLalgorithm shown in Figure 13 changes the policy in an exponentially rare fashion; if the length of the current episodeis L , the next episode would be L . This switching policy ensures that the total number of switches is O (log T ) . Inputs : P , the prior distribution of θ ∗ . L ← . for t ← , , . . . doif t = L then Sample (cid:101) θ t ∼ P t . L ← L . else (cid:101) θ t ← (cid:101) θ t − . end if Calculate near-optimal action a t ← π ∗ ( x t , (cid:101) θ t ) .Execute action a t and observe the new state x t +1 .Update P t with ( x t , a t , x t +1 ) to obtain P t +1 . end for Figure 13: The DS-PSRL algorithm with deterministic schedule of policy updates.The algorithm makes three assumptions. First is assumes assume that MDP is weakly communicating. This is astandard assumption and under this assumption, the optimal average loss satisﬁes the Bellman equation. Second, itassumes that the dynamics are parametrized by a scalar parameter and satisfy a smoothness condition.

Assumption 7.1 (Lipschitz Dynamics) . There exist a constant C such that for any state x and action a and parameters θ, θ (cid:48) ∈ Θ ⊆ (cid:60) , (cid:107) P ( . | x, a, θ ) − P ( . | x, a, θ (cid:48) ) (cid:107) ≤ C | θ − θ (cid:48) | . Third, it makes a concentrating posterior assumption, which states that the variance of the difference between the trueparameter and the sampled parameter gets smaller as more samples are gathered.

Assumption 7.2 (Concentrating Posterior) . Let N j be one plus the number of steps in the ﬁrst j episodes. Let (cid:101) θ j besampled from the posterior at the current episode j . Then there exists a constant C (cid:48) such that max j E (cid:20) N j − (cid:12)(cid:12)(cid:12) θ ∗ − (cid:101) θ j (cid:12)(cid:12)(cid:12) (cid:21) ≤ C (cid:48) log T .

The 7.2 assumption simply says the variance of posterior decreases given more data. In other words, we assume thatthe problem is learnable and not a degenerate case. 7.2 was actually shown to hold for two general categories ofproblems, ﬁnite MDPs and linearly parametrized problems with Gaussian noise [1]. Under these assumptions we thefollowing theorem can be proven [66].

Theorem 7.3.

Under Assumption 7.1 and 7.2, the regret of the DS-PSRL algorithm is bounded as R T = (cid:101) O ( C √ C (cid:48) T ) , where the (cid:101) O notation hides logarithmic factors. Notice that the regret bound in Theorem 7.3 does not directly depend on S or A . Moreover, notice that the regretbound is smaller if the Lipschitz constant C is smaller or the posterior concentrates faster (i.e. C (cid:48) is smaller). Here we summarize how the parameterization assumption in Equation 9 satisﬁes assumptions 7.1 and 7.2.

Lipschitz Dynamics

We can show that the dynamics are Lipschitz continuous:

Lemma 7.4. (Lipschitz Continuity) Assume the dynamics are given by Equation 9. Then for all θ, θ (cid:48) ≥ and all X and a , we have (cid:107) P ( ·| X, a, θ ) − P ( ·| X, a, θ (cid:48) ) (cid:107) ≤ e | θ − θ (cid:48) | . PREPRINT - S

EPTEMBER

17, 2020

Concentrating Posterior we can also show that Assumption 7.2 holds. Speciﬁcally, we can show that under mildtechnical conditions, we have max j E (cid:20) N j − (cid:12)(cid:12)(cid:12) θ ∗ − (cid:101) θ j (cid:12)(cid:12)(cid:12) (cid:21) = O (1) Please refer to [66] for the proofs.

Accepting recommendations needs deeper consideration than simply predicting click through probability of an offer.In this section we examine two acceptance factors, the ‘propensity to listen’ and ‘recommendations fatigue’. The‘propensity to listen’ is a byproduct of the passive data solution shown in Section 7. The ‘recommendation fatigue’ isa problem where people may quickly stop paying attention to recommendations such as ads, if they are presented toooften. The property of RL algorithms for solving delayed reward problems gives a natural solution to this fundamentalmarketing problem. For example, if the decision was to recommend, or not some product every day, where the ﬁnalgoal would be to buy at some point in time, then RL would naturally optimize the right sending schedule and thusavoid fatigue. In this section we present experimental results for a Point-of-Interest (POI) recommendation systemthat solves both, the ‘propensity to listen’ as well as ‘recommendation fatigue’ problems [65].We experimented with a points of interest domain. For experiments we used the Yahoo! Flicker Creative Commons100M (YFCC100M) [71], which consists of 100M Flickr photos and videos. This dataset also comprises the metainformation regarding the photos, such as the date/time taken, geo-location coordinates and accuracy of these geo-location coordinates. The geo-location accuracy range from world level (least accurate) to street level (most accurate).We used location sequences that were mapped to POIs near Melbourne Australia . After preprocessing, and removingloops, we had 7246 trajectories and 88 POIs.We trained a PST using the data and performed various experiments to test the ability of our algorithm to quicklyoptimize the cumulative reward for a given user. We used θ = { , , } and did experiments assuming the true userto be any of those θ . For reward, we used a signal between [0 , indicating the frequency/desirability of the POIs.We computed the frequency from the data. The action space was a recommendation for each POI (88 POIs), plus anull action. All actions but the null action had a cost . of the reward. Recommending a POI that was already seen(e.g in the current sufﬁx) had an additional cost of . of the reward. This was done in order to reduce the number ofrecommendations otherwise called the fatigue factor. We compared DS-PSRL with greedy policies. Greedy policiesdo not solve the underlying MDP but rather choose the action with maximum immediate reward, which is equivalentto the classic Thompson sampling for contextual bandits. PSRL could also be thought of as Thompson sampling forMDPs. We also compared with the optimal solution, which is the one that knows the true model from the beginning.Our experiments are shown in Tables 1 and 2 and Figure 14. DS-PSRL quickly learns to optimize the average reward.At the same time it produces more reward than the greedy approach, while minimizing the fatigue factor. GREEDY MDP ∗ ∗ .The columns label indicate the type of policies being used. So far we have considered recommendation systems that consider each user individually, ignoring the collective effectsof recommendations. However, ignoring the collective effects could result in diminished utility for the user, forexample through overcrowding at high-value points of interest (POI) considered in Section 8. In this section wesummarize a solution that can optimize for both latent factors and resource constraints in SR systems [45].To incorporate collective effects in recommendation systems, we extend our model to a multi-agent system with global capacity constraints , representing for example, the maximum capacity for visitors at a POI. Furthermore, to handlelatent factors, we model each user as a partially observable decision problem, where the hidden state factor representsthe user’s latent interests. The data and pre-processing algorithms are publicly available on https://github.com/arongdari/ﬂickr-photo PREPRINT - S

EPTEMBER

17, 2020Time

GREEDY MDP θ . DS-PSRL denoted with solidlines learns quickly for different true θ .An optimal decision policy for this partially observable problem chooses recommendations that ﬁnd the best possibletrade-off between exploration and exploitation. Unfortunately, both global constraints and partial observability makeﬁnding the optimal policy intractable in general. However, we show that the structure of this problem can be ex-ploited, through a novel belief-space sampling algorithm which bounds the size of the state space by a limit on regretincurred from switching from the partially observable model to the most likely fully observable model. We showhow to decouple constraint satisfaction from sequential recommendation policies, resulting in algorithms which issuerecommendations to thousands of agents while respecting constraints. While PSRL (Section 7.4) eventually converges to the optimal policy, it will never select actions which are not partof the optimal policy for any MDP θ , even if this action would immediately reveal the true parameters θ ∗ to thelearner. In order to reason about such information gathering actions, a recommender should explicitly consider thedecision-theoretic value of information [27]. To do so, we follow [12] in modeling such a hidden-model MDP as aMixed-Observability MDP (MOMDP).The state space of a MOMDP model factors into a fully observable factor x ∈ X and a partially observable fac-tor y ∈ Y , each with their own transition functions, T X ( x (cid:48) | x, y, a ) and T Y ( y (cid:48) | x, y, a, x (cid:48) ) . An observationfunction Ω( o | a, y (cid:48) ) exists to inform the decision maker about transitions of the hidden factor. However, in addition22 PREPRINT - S

EPTEMBER

17, 2020to the observations, the decision maker also conditions his policy π ( t, x, o ) on the observable factor x . Given a para-metric MDP (cid:104) Θ , S, A, ¯ R, ¯ T , h (cid:105) over a ﬁnite set of user types Θ , for example as generated in Section 7.3, we derive anequivalent MOMDP (cid:104) X, Y, A, O, T X , T Y , R, Ω , h (cid:105) having elements X = S, Y = Θ , R ( s, θ, a ) = R θ ( s, a ) , T X ( s (cid:48) | s, θ, a ) = T θ ( s (cid:48) | s, a ) ,O = { o NULL } , Ω( o NULL | a, θ (cid:48) ) = 1 , T Y ( θ (cid:48) | s, θ, a, s (cid:48) ) = (cid:26) if θ = θ (cid:48) , otherwise . (10)The model uses the latent factor Y to represent the agent’s type, selecting the type-speciﬁc transition and rewardfunctions based on its instantiation. Because a user’s type does not change over the plan horizon, the model is a‘stationary’ Mixed-Observability MDP [41]. The observation function O is uninformative, meaning that there is nodirect way to infer a user’s type. Intuitively, this means the recommender can only learn a user’s type by observingstate transitions.This gives a recommender model for a single user i , out of a total of n users. To model the global capacities at differentpoints of interest, we employ a consumption function C and limit vector L deﬁned over m POIs. The consumptionof resource type r is deﬁned using function C r : S × A → { , } , where 1 indicates that the user is present at r .The limit function L r gives POI r ’s peak capacity. The optimal (joint) recommender policy satisﬁes the constraints inexpectation , optimizing max π E (cid:2) V π (cid:3) , subject to E (cid:2) C πr,t (cid:3) ≤ L r ∀ t, r. (11)For multi-agent problems of reasonable size, directly optimizing this joint policy is infeasible. For such modelsColumn Generation (CG; [23]) has proven to be an effective algorithm [46, 75, 77]. Agent planning problems aredecoupled by augmenting the optimality criterion of the (single-agent) planning problem with a Lagrangian termpricing the expected resource consumption cost E [ C π i r,t ] , i.e., arg max π i (cid:16) E [ V π i ] − (cid:88) t,r λ t,r E [ C π i r,t ] (cid:17) ∀ i. (12)This routine is used to compute a new policy π i to be added to a set Z i of potential policies of agent i . These sets formthe search space of the CG LP, which optimizes the current best joint mix of policies subject to constraints, by solving: max x i,j n (cid:88) i =1 (cid:88) π i,j ∈ Z i x i,j E [ V π i,j ] , s.t. n (cid:88) i =1 (cid:88) π i,j ∈ Z i x i,j E [ C π i,j r,t ] ≤ L r ∀ r, t, (cid:88) π i,j ∈ Z i x i,j = 1 , and x i,j ≥ ∀ i, j. (13)Solving this LP results in: 1) a probability distribution over policies, having agents follow a policy with Pr( π i = π i,j ) = x i,j , and 2) a new set of dual prices λ (cid:48) t,r to use in the subsequent iteration. This routine stops once λ = λ (cid:48) , atwhich point a global optimum is found. Unfortunately, in every iteration of column generation, we need to ﬁnd n optimal policies satisfying Equation (12),which in itself has PSPACE complexity for MOMDPs [48]. Therefore, we propose a heuristic algorithm exploitingthe structure of our problems: bounded belief state space planning (Alg. 8).To plan for partially observable MDP models it is convenient to reason over belief states [31]. In our case, a beliefstate b records a probability distribution over the possible types Θ , with b ( θ ) indicating how likely the agent is oftype θ . Given a belief state b , the action taken a , and the observation received o , the subsequent belief state b (cid:48) ( θ ) can be derived using application of Bayes’ theorem. In principle, this belief-state model can be used to compute theoptimal policy, but the exponential size of B prohibits this. Therefore, approximation algorithms generally focus on asubset of the space B (cid:48) .When computing a policy π for a truncated belief space B (cid:48) we have to be careful that we compute unbiased con-sumption expectations E [ C π ] , to guarantee feasibility of the Column Generation solution. This can be achieved if weknow the exact expected consumption of the policy at each ‘missing’ belief point not in B (cid:48) . For corners of the belief23 PREPRINT - S

EPTEMBER

17, 2020

Algorithm 8

Bounded belief state space planning [45]. Given parametric MDP (cid:104) Θ , S, A, ¯ R, ¯ T , h (cid:105) and approximate belief space B (cid:48) Plan π ∗ j for all j Compute V θ i ,π ∗ j for all i , j Create policy π [ b ] for time t = h → do for belief point b ∈ B (cid:48) ( t ) do V [ b ] = −∞ for action a ∈ A do Q [ b, a ] = R ( b, a ) for observed next state s (cid:48) ∈ S do b (cid:48) = updateBelief ( b, a, s (cid:48) ) if b (cid:48) ∈ B (cid:48) then Q [ b, a ] = Q [ b, a ] + Pr( s (cid:48) | b, a ) · V [ b (cid:48) ] else j = arg max j Q (cid:2) b (cid:48) , π ∗ j (cid:3) π [ b (cid:48) ] = π ∗ j Q [ b, a ] = Q [ b, a ] + Pr( s (cid:48) | b, a ) · ¯ V (cid:2) b (cid:48) (cid:3) end if end for if Q [ b, a ] > V [ b ] then V [ b ] = Q [ b, a ] π [ b ] = a end if end for end for end for return (cid:104) π, V [ b ] (cid:105) space, where b ( θ i ) = 1 (and b ( θ j ) = 0 for i (cid:54) = j ), the fact that agent types are stationary ensures that the optimalcontinuation is the optimal policy for the MDP θ i . If we use the same policy in a non-corner belief, policy π ∗ i mayinstead be applied on a different MDP θ j , with probability b ( θ j ) . In general, the expected value of choosing policy π ∗ i in belief point (cid:104) t, s, b (cid:105) is Q (cid:2) (cid:104) t, s, b (cid:105) , π ∗ i (cid:3) = | Θ | (cid:88) j =1 (cid:16) b ( θ j ) · V θ j π ∗ i [ t, s ] (cid:17) . (14)For belief points close to corner i , policy π ∗ i will be the optimal policy with high probability. If we take care toconstruct B (cid:48) such that truncated points are close to corners, we can limit our search to the optimal policies of eachtype, ¯ V (cid:2) (cid:104) t, s, b (cid:105) (cid:3) = max θ i ∈ Θ Q (cid:2) (cid:104) t, s, b (cid:105) , π ∗ i (cid:3) . (15)When we apply policy π ∗ i in a belief point that is not a corner, we incur regret proportional to the amount of valuelost from getting the type wrong. Policy π ∗ i applied to MDP θ j obtains expected value V θ j π ∗ i ≤ V θ j π ∗ j by deﬁnition ofoptimality. Thus, the use of policy π ∗ i in belief point b incurs a regret of REGRET ( b ) = min i REGRET ( b, i ) = min i | Θ | (cid:88) j =1 (cid:18) b ( θ j ) · (cid:16) V θ j π ∗ j − V θ j π ∗ i (cid:17)(cid:19) . (16)This regret function can serve as a scoring rule for belief points worth considering in belief space B (cid:48) . Let P( b ) standfor the probability of belief point b , then we generate all subsequent belief points from initial belief b that meet athreshold (for hyper-parameters minimum probability p and shape α ): b ∈ B (cid:48) if REGRET ( b ) > (cid:0) e − α (P( b ) − p ) − e − α (1 − p ) (cid:1) · REGRET ( b ) . (17)Algorithm 8 starts by computing the optimal MDP policy π ∗ j for each type θ j , followed by determining the exactexpected values V θ i π ∗ j of applying these policies to all different user types θ i . The remainder of the algorithm computes24 PREPRINT - S

EPTEMBER

17, 2020expected values at each belief point in regret-truncated space B (cid:48) , according to the typical dynamic programmingrecursion. However, in case of a missing point b (cid:48) , the best policy π ∗ j is instead selected (line 15), and the expectedvalue of using this MDP policy is computed according to the belief state. The resulting policy π thus consists of twostages: the maximally valued action stored in π [ b ] is selected, unless b / ∈ B (cid:48) , at which point MDP policy π ∗ j replaces π for the remaining steps. By bounding the exponential growth of the state space, Algorithm 8 trades off solution quality for scalability. To assessthis trade-off, we perform an experiment on the POI recommendation problem introduced in Section 8. We comparewith the highly scalable PSRL on the one hand, and state-of-the-art mixed-observability MDP planner SARSOP [36]on the other. We consider a problem consisting of 5 POIs, 3 user types, 50 users and PST depth 1. For this experimentwe measure the quality of the computed policy as the mean over 1,000 simulations per instance, solving instances persetting of the horizon. We consider two settings, the regular single recommendation case, and a dual recommendationcase where the recommender is allowed to give an alternative to the main recommendation, which may provide moreopportunities to gather information in each step. Planner

PSRLBounded−regretSARSOP, γ = γ =

Single recommendation

Dual recommendations

Horizon ( h ) Horizon ( h ) M ea nv a l u e R un ti m e ( m ) Figure 15: Solution quality and planning time of the different sequential recommendation planners, as a function ofthe horizon.Figure 15 presents the results. The top row presents the observed mean reward, while the bottom row presents therequired planning time in minutes. We observe that for our constrained ﬁnite-horizon problems, SARSOP quicklybecomes intractable, even when the discount factor is set very low. However, by not optimizing for information value,PSRL obtains signiﬁcantly lower lifetime value. Our algorithm ﬁnds policies which do maximize information value,while at the same time remaining tractable through its effective bounding condition on the state space growth. We notethat its runtime stops increasing signiﬁcantly beyond h = 20 , as a result of the bounded growth of the state space.

10 Large Action Spaces

In many real-world recommendation systems the number of actions could be prohibitively large. Netﬂix for exam-ple employs a few thousands of movie recommendations. For SR systems the difﬁculty is even more severe, sincethe search space grows exponentially with the planning horizon. In this section we show how to learn action em-beddings for action generalization. Most model-free reinforcement learning methods leverage state representations(embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori . We show how a policy can be decomposed into a component that acts in a low-dimensional space of ac-tion representations and a component that transforms these representations into actual actions. These representationsimprove generalization over large, ﬁnite action sets by allowing the agent to infer the outcomes of actions similar toactions already taken. We provide an algorithm to both learn and use action representations and provide conditions forits convergence. The efﬁcacy of the proposed method is demonstrated on large-scale real-world problems [13].25

PREPRINT - S

EPTEMBER

17, 2020Figure 16: (Left) The structure of the proposed overall policy, π o , consisting of f and π i , that learns action repre-sentations to generalize over large action sets. (Right) Illustration of the probability induced for three actions by theprobability density of π i ( e | s ) on a -D embedding space. The x -axis represents the embedding, e , and the y -axisrepresents the probability. The colored regions represent the mapping a = f ( e ) , where each color is associated with aspeciﬁc action. The beneﬁts of capturing the structure in the underlying state space of MDPs is a well understood and a widely usedconcept in RL. State representations allow the policy to generalize across states. Similarly, there often exists additionalstructure in the space of actions that can be leveraged. We hypothesize that exploiting this structure can enable quickgeneralization across actions, thereby making learning with large action sets feasible. To bridge the gap, we introducean action representation space,

E ⊆ R d , and consider a factorized policy, π o , parameterized by an embedding-to-actionmapping function, f : E → A , and an internal policy, π i : S × E → [0 , , such that the distribution of A t given S t ischaracterized by: E t ∼ π i ( ·| S t ) , A t = f ( E t ) . (18)Here, π i is used to sample E t ∈ E , and the function f deterministically maps this representation to an action in theset A . Both these components together form an overall policy , π o . Figure 16 (Right) illustrates the probability of eachaction under such a parameterization. With a slight abuse of notation, we use f − ( a ) as a one-to-many function thatdenotes the set of representations that are mapped to the action a by the function f , i.e., f − ( a ) := { e ∈ E : f ( e ) = a } .In the following sections we discuss the existence of an optimal policy π ∗ o and the learning procedure for π o . Toelucidate the steps involved, we split it into four parts. First, we show that there exist f and π i such that π o is anoptimal policy. Then we present the supervised learning process for the function f when π i is ﬁxed. Next we givethe policy gradient learning process for π i when f is ﬁxed. Finally, we combine these methods to learn f and π i simultaneously. π i and f to Represent an Optimal Policy In this section, we aim to establish a condition under which π o can represent an optimal policy. Consequently, we thendeﬁne the optimal set of π o and π i using the proposed parameterization. To establish the main results we begin withthe necessary assumptions.The characteristics of the actions can be naturally associated with how they inﬂuence state transitions. In order tolearn a representation for actions that captures this structure, we consider a standard Markov property, often used forlearning probabilistic graphical models [22], and make the following assumption that the transition information canbe sufﬁciently encoded to infer the action that was executed. Assumption 10.1.

Given an embedding E t , A t is conditionally independent of S t and S t +1 : P ( A t | S t , S t +1 ) = (cid:90) E P ( A t | E t = e ) P ( E t = e | S t , S t +1 ) d e. Assumption 10.2.

Given the embedding E t the action, A t is deterministic and is represented by a function f : E → A ,i.e., ∃ a such that P ( A t = a | E t = e ) = 1 . We now establish a necessary condition under which our proposed policy can represent an optimal policy. Thiscondition will also be useful later when deriving learning rules.26

PREPRINT - S

EPTEMBER

17, 2020

Lemma 10.3.

Under Assumptions (10.1) – (10.2) , which deﬁnes a function f , for all π , there exists a π i such that v π ( s ) = (cid:88) a ∈A (cid:90) f − ( a ) π i ( e | s ) q π ( s, a ) d e. (19)The proof is available in [13]. Following Lemma (10.3), we use π i and f to deﬁne the overall policy as π o ( a | s ) := (cid:90) f − ( a ) π i ( e | s ) d e. (20) Theorem 10.4.

Under Assumptions (10.1) – (10.2) , which deﬁnes a function f , there exists an overall policy, π o , suchthat v π o = v (cid:63) .Proof. This follows directly from Lemma 10.3. Because the state and action sets are ﬁnite, the rewards are bounded,and γ ∈ [0 , , there exists at least one optimal policy. For any optimal policy π (cid:63) , the corresponding state-value andstate-action-value functions are the unique v (cid:63) and q (cid:63) , respectively. By Lemma 10.3 there exist f and π i such that v (cid:63) ( s ) = (cid:88) a ∈A (cid:90) f − ( a ) π i ( e | s ) q (cid:63) ( s, a ) d e. (21)Therefore, there exists π i and f , such that the resulting π o has the state-value function v π o = v (cid:63) , and hence itrepresents an optimal policy.Note that Theorem 10.4 establishes existence of an optimal overall policy based on equivalence of the state-valuefunction, but does not ensure that all optimal policies can be represented by an overall policy. Using (21), we deﬁne Π (cid:63)o := { π o : v π o = v (cid:63) } . Correspondingly, we deﬁne the set of optimal internal policies as Π (cid:63)i := { π i : ∃ π (cid:63)o ∈ Π (cid:63)o , ∃ f, π (cid:63)o ( a | s ) = (cid:82) f − ( a ) π i ( e | s ) d e } . f for a Fixed π i Theorem 10.4 shows that there exist π i and a function f , which helps in predicting the action responsible for thetransition from S t to S t +1 , such that the corresponding overall policy is optimal. However, such a function, f , maynot be known a priori . In this section, we present a method to estimate f using data collected from visits with theenvironment.By Assumptions (10.1)–(10.2), P ( A t | S t , S t +1 ) can be written in terms of f and P ( E t | S t , S t +1 ) . We propose search-ing for an estimator, ˆ f , of f and an estimator, ˆ g ( E t | S t , S t +1 ) , of P ( E t | S t , S t +1 ) such that a reconstruction of P ( A t | S t , S t +1 ) is accurate. Let this estimate of P ( A t | S t , S t +1 ) based on ˆ f and ˆ g be ˆ P ( A t | S t , S t +1 ) = (cid:90) E ˆ f ( A t | E t = e )ˆ g ( E t = e | S t , S t +1 ) d e (22) One way to measure the difference between P ( A t | S t , S t +1 ) and ˆ P ( A t | S t , S t +1 ) is using the expected (over statescoming from the on-policy distribution) Kullback-Leibler (KL) divergenceKL ( P ( A t | S t , S t +1 ) || ˆ P ( A t | S t , S t +1 )) = − E (cid:34)(cid:88) a ∈A P ( a | S t , S t +1 ) ln (cid:32) ˆ P ( a | S t , S t +1 ) P ( a | S t , S t +1 ) (cid:33)(cid:35) (23) = − E (cid:34) ln (cid:32) ˆ P ( A t | S t , S t +1 ) P ( A t | S t , S t +1 ) (cid:33)(cid:35) . (24)Since the observed transition tuples, ( S t , A t , S t +1 ) , contain the action responsible for the given S t to S t +1 transition,an on-policy sample estimate of the KL-divergence can be computed readily using (24). We adopt the following lossfunction based on the KL divergence between P ( A t | S t , S t +1 ) and ˆ P ( A t | S t , S t +1 ) : L ( ˆ f , ˆ g ) = − E (cid:104) ln (cid:16) ˆ P ( A t | S t , S t +1 ) (cid:17)(cid:105) , (25)where the denominator in (24) is not included in (25) because it does not depend on ˆ f or ˆ g . If ˆ f and ˆ g are parameterized,their parameters can be learned by minimizing the loss function, L , using a supervised learning procedure.27 PREPRINT - S

EPTEMBER

17, 2020Figure 17: (a) Given a state transition tuple, functions g and f are used to estimate the action taken. The red arrowdenotes the gradients of the supervised loss (25) for learning the parameters of these functions. (b) During execution,an internal policy, π i , can be used to ﬁrst select an action representation, e . The function f , obtained from previouslearning procedure, then transforms this representation to an action. The blue arrow represents the internal policygradients (27) obtained using Lemma 10.5 to update π i .A computational graph for this model is shown in Figure 17. Note that, while ˆ f will be used for f in an overall policy, ˆ g is only used to ﬁnd ˆ f , and will not serve an additional purpose.As this supervised learning process only requires estimating P ( A t | S t , S t +1 ) , it does not require (or depend on) therewards. This partially mitigates the problems due to sparse and stochastic rewards, since an alternative informativesupervised signal is always available. This is advantageous for making the action representation component of theoverall policy learn quickly and with low variance updates. π i For a Fixed f A common method for learning a policy parameterized with weights θ is to optimize the discounted start-state objectivefunction, J ( θ ) := (cid:80) s ∈S d ( s ) v π ( s ) . For a policy with weights θ , the expected performance of the policy can beimproved by ascending the policy gradient , ∂J ( θ ) ∂θ .Let the state-value function associated with the internal policy, π i , be v π i ( s ) = E [ (cid:80) ∞ t =0 γ t R t | s, π i , f ] , and the state-action value function q π i ( s, e ) = E [ (cid:80) ∞ t =0 γ t R t | s, e, π i , f ] . We then deﬁne the performance function for π i as: J i ( θ ) := (cid:88) s ∈S d ( s ) v π i ( s ) . (26)Viewing the embeddings as the action for the agent with policy π i , the policy gradient theorem [61], states that theunbiased [67] gradient of (26) is, ∂J i ( θ ) ∂θ = ∞ (cid:88) t =0 E (cid:20) γ t (cid:90) E q π i ( S t , e ) ∂∂θ π i ( e | S t ) d e (cid:21) , (27)where, the expectation is over states from d π , as deﬁned in [61] (which is not a true distribution, since it is notnormalized). The parameters of the internal policy can be learned by iteratively updating its parameters in the directionof ∂J i ( θ ) /∂θ . Since there are no special constraints on the policy π i , any policy gradient algorithm designed forcontinuous control, like DPG [56], PPO [54], NAC [8] etc., can be used out-of-the-box.However, note that the performance function associated with the overall policy, π o (consisting of function f and theinternal policy parameterized with weights θ ), is: J o ( θ, f ) = (cid:88) s ∈S d ( s ) v π o ( s ) . (28)The ultimate requirement is the improvement of this overall performance function, J o ( θ, f ) , and not just J i ( θ ) . So,how useful is it to update the internal policy, π i , by following the gradient of its own performance function? Thefollowing lemma answers this question. Lemma 10.5.

For all deterministic functions, f , which map each point, e ∈ R d , in the representation space to anaction, a ∈ A , the expected updates to θ based on ∂J i ( θ ) ∂θ are equivalent to updates based on ∂J o ( θ,f ) ∂θ . That is, ∂J o ( θ, f ) ∂θ = ∂J i ( θ ) ∂θ . PREPRINT - S

EPTEMBER

17, 2020

Algorithm 1:

Policy Gradient with Representations for Action (PG-RA) Initialize action representations for episode = 0 , , ... do for t = 0 , , ... do Sample action embedding, E t , from π i ( ·| S t ) A t = ˆ f ( E t ) Execute A t and observe S t +1 , R t Update π i using any policy gradient algorithm Update critic (if any) to minimize TD error Update ˆ f and ˆ g to minimize L deﬁned in (25)The proof is available in [13]. The chosen parameterization for the policy has this special property, which allows π i to be learned using its internal policy gradient. Since this gradient update does not require computing the value ofany π o ( a | s ) explicitly, the potentially intractable computation of f − in (20) required for π o can be avoided. Instead, ∂J i ( θ ) /∂θ can be used directly to update the parameters of the internal policy while still optimizing the overall policy’sperformance, J o ( θ, f ) . π i and f Simultaneously

Since the supervised learning procedure for f does not require rewards, a few initial trajectories can contain enoughinformation to begin learning a useful action representation. As more data becomes available it can be used for ﬁne-tuning and improving the action representations. We call our algorithm p olicy g radients with r epresentations for a ctions (PG-RA). PG-RA ﬁrst initializes the param-eters in the action representation component by sampling a few trajectories using a random policy and using thesupervised loss deﬁned in (25). If additional information is known about the actions, as assumed in prior work [17], itcan also be considered when initializing the action representations. Optionally, once these action representations areinitialized, they can be kept ﬁxed.In the Algorithm 1, Lines - illustrate the online update procedure for all of the parameters involved. Each time stepin the episode is represented by t . For each step, an action representation is sampled and is then mapped to an action by ˆ f . Having executed this action in the environment, the observed reward is then used to update the internal policy, π i ,using any policy gradient algorithm. Depending on the policy gradient algorithm, if a critic is used then semi-gradientsof the TD-error are used to update the parameters of the critic. In other cases, like in REINFORCE [76] where thereis no critic, this step can be ignored. The observed transition is then used in Line to update the parameters of ˆ f and ˆ g so as to minimize the supervised learning loss (25). In our experiments, Line uses a stochastic gradient update. If the action representations are held ﬁxed while learning the internal policy, then as a consequence of Lemma 10.5,convergence of our algorithm directly follows from previous two-timescale results [10, 8]. Here we show that learningboth π i and f simultaneously using our PG-RA algorithm can also be shown to converge by using a three-timescaleanalysis.Similar to prior work [8, 16, 34], for analysis of the updates to the parameters, θ ∈ R d θ , of the internal policy, π i ,we use a projection operator Γ : R d θ → R d θ that projects any x ∈ R d θ to a compact set C ⊂ R d θ . We then deﬁnean associated vector ﬁeld operator, ˆΓ , that projects any gradients leading outside the compact region, C , back to C .Practically, however, we do not project the iterates to a constraint region as they are seen to remain bounded (withoutprojection). Formally, we make the following assumptions, Assumption 10.6.

For any state action-representation pair (s,e), internal policy, π i ( e | s ) , is continuously differentiablein the parameter θ . Assumption 10.7.

The updates to the parameters, θ ∈ R d θ , of the internal policy, π i , includes a projection opera-tor Γ : R d θ → R d θ that projects any x ∈ R d θ to a compact set C = { x | c i ( x ) ≤ , i = 1 , ..., n } ⊂ R d θ , where c i ( · ) , i = 1 , ..., n are real-valued, continuously differentiable functions on R d θ that represents the constraints specify- PREPRINT - S

EPTEMBER

17, 2020 ing the compact region. For each x on the boundary of C , the gradients of the active c i are considered to be linearlyindependent. Assumption 10.8.

The iterates ω t and φ t satisfy sup t ( || ω t || ) < ∞ and sup t ( || φ t || ) < ∞ . Theorem 10.9.

Under Assumptions (10.1) – (10.8) , the internal policy parameters θ t , converge to ˆ Z = (cid:110) x ∈ C| ˆΓ (cid:16) ∂J i ( x ) ∂θ (cid:17) = 0 (cid:111) as t → ∞ , with probability one.Proof. (Outline) We consider three learning rate sequences, such that the update recursion for the internal policyis on the slowest timescale, the critic’s update recursion is on the fastest, and the action representation module’shas an intermediate rate. With this construction, we leverage the three-timescale analysis technique [9] and proveconvergence. The complete proof is available in [13]. We evaluate our proposed algorithms on the following domains.

Maze:

As a proof-of-concept, we constructed a continuous-state maze environment where the state comprised ofthe coordinates of the agent’s current location. The agent has n equally spaced actuators (each actuator moves theagent in the direction the actuator is pointing towards) around it, and it can choose whether each actuator should beon or off. Therefore, the size of the action set is exponential in the number of actuators, that is |A| = 2 n . The netoutcome of an action is the vectorial summation of the displacements associated with the selected actuators. The agentis rewarded with a small penalty for each time step, and a reward of is given upon reaching the goal position. Tomake the problem more challenging, random noise was added to the action of the time and the maximum episodelength was steps.This environment is a useful test bed as it requires solving a long horizon task in an MDP with a large action set anda single goal reward. Further, we know the Cartesian representation for each of the actions, and can thereby use it tovisualize the learned representation, as shown in Figure 18. Real-word recommender systems:

We consider two real-world applications of recommender systems that requiredecision making over multiple time steps .First, a web-based video-tutorial platform, which has a recommendation engine that suggests a series of tutorial videoson various software. The aim is to meaningfully engage the users in learning how to use these software and convertnovice users into experts in their respective areas of interest. The tutorial suggestion at each time step is made from alarge pool of available tutorial videos on several software.The second application is a professional multi-media editing software. Modern multimedia editing software oftencontain many tools that can be used to manipulate the media, and this wealth of options can be overwhelming forusers. In this domain, an agent suggests which of the available tools the user may want to use next. The objective is toincrease user productivity and assist in achieving their end goal.For both of these applications, an existing log of user’s click stream data was used to create an n -gram based MDPmodel for user behavior [55]. In the tutorial recommendation task, user activity for a three month period was ob-served. Sequences of user visit were aggregated to obtain over million clicks. Similarly, for a month long duration,sequential usage patterns of the tools in the multi-media editing software were collected to obtain a total of over . billion user clicks. Tutorials and tools that had less than clicks in total were discarded. The remaining tutorials and tools for the web-based tutorial platform and the multi-media software, respectively, were used tocreate the action set for the MDP model. The MDP had continuous state-space, where each state consisted of thefeature descriptors associated with each item (tutorial or tool) in the current n -gram. Rewards were chosen based ona surrogate measure for difﬁculty level of tutorials and popularity of ﬁnal outcomes of user visits in the multi-mediaediting software, respectively. Since such data is sparse, only of the items had rewards associated with them, andthe maximum reward for any item was .Often the problem of recommendation is formulated as a contextual bandit or collaborative ﬁltering problem, but asshown in [64] these approaches fail to capture the long term value of the prediction. Solving this problem for a longertime horizon with a large number of actions (tutorials/tools) makes this real-life problem a useful and a challengingdomain for RL algorithms. 30 PREPRINT - S

EPTEMBER

17, 2020

Figure 18: (a) The maze environment. The star denotes the goal state, the red dot corresponds to the agent and thearrows around it are the actuators. Each action corresponds to a unique combination of these actuators. Therefore,in total actions are possible. (b) 2-D representations for the displacements in the Cartesian co-ordinates caused byeach action, and (c) learned action embeddings. In both (b) and (c), each action is colored based on the displacement( ∆ x , ∆ y ) it produces. That is, with the color [R= ∆ x , G= ∆ y , B= . ], where ∆ x and ∆ y are normalized to [0 , before coloring. Cartesian actions are plotted on co-ordinates ( ∆ x , ∆ y ), and learned ones are on the coordinates in theembedding space. Smoother color transition of the learned representation is better as it corresponds to preservation ofthe relative underlying structure. The ‘squashing’ of the learned embeddings is an artifact of a non-linearity appliedto bound its range.To understand the internal working of our proposed algorithm, we present visualizations of the learned action represen-tations on the maze domain. A pictorial illustration of the environment is provided in Figure 18. Here, the underlyingstructure in the set of actions is related to the displacements in the Cartesian coordinates. This provides an intuitivebase case against which we can compare our results.In Figure 18, we provide a comparison between the action representations learned using our algorithm and the under-lying Cartesian representation of the actions. It can be seen that the proposed method extracts useful structure in theaction space. Actions which correspond to settings where the actuators on the opposite side of the agent are selectedresult in relatively small displacements to the agent. These are the ones in the center of plot. In contrast, maximumdisplacement in any direction is caused by only selecting actuators facing in that particular direction. Actions cor-responding to those are at the edge of the representation space. The smooth color transition indicates that not onlythe information about magnitude of displacement but the direction of displacement is also represented. Therefore, thelearned representations efﬁciently preserve the relative transition information among all the actions. To make explo-ration step tractable in the internal policy, π i , we bound the representation space along each dimension to the range[ − , ] using Tanh non-linearity. This results in ‘squashing’ of these representations around the edge of this range.

Performance Improvement

The plots in Figure 19 for the Maze domain show how the performance of standard actor-critic (AC) method deteri-orates as the number of actions increases, even though the goal remains the same. However, with the addition of anaction representation module it is able to capture the underlying structure in the action space and consistently performwell across all settings. Similarly, for both the tutorial and the software MDPs, standard AC methods fail to reasonover longer time horizons under such an overwhelming number of actions, choosing mostly one-step actions that havehigh returns. In comparison, instances of our proposed algorithm are not only able to achieve signiﬁcantly higherreturn, up to × and × in the respective tasks, but they do so much quicker. These results reinforce our claim thatlearning action representations allow implicit generalization of feedback to other actions embedded in proximity toexecuted action.Further, under the PG-RA algorithm, only a fraction of total parameters, the ones in the internal policy, are learnedusing the high variance policy gradient updates. The other set of parameters associated with action representationsare learned by a supervised learning procedure. This reduces the variance of updates signiﬁcantly, thereby making thePG-RA algorithms learn a better policy faster. This is evident from the plots in the Figure 19. These advantages allowthe internal policy, π i , to quickly approximate an optimal policy without succumbing to the curse of large actions sets.31 PREPRINT - S

EPTEMBER

17, 2020Figure 19: (Top) Results on the Maze domain with , , and actions respectively. (Bottom) Results on a) TutorialMDP b) Software MDP. AC-RA and DPG-RA are the variants of PG-RA algorithm that uses actor-critic (AC) andDPG, respectively. The shaded regions correspond to one standard deviation and were obtained using trials.

11 Dynamic Actions

Beside the large number of actions, in many real-world sequential decision making problems, the number of availableactions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics,changing rewards function, etc. have been well-studied in the lifelong learning literature, the setting where the actionset changes remains unaddressed. In this section, we present an algorithm that autonomously adapts to an actionset whose size changes over time. To tackle this open problem, we break it into two problems that can be solvediteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leveragesthis structure. We demonstrate the efﬁciency of this approach on large-scale real-world lifelong learning problems[14].

MDPs, the standard formalization of decision making problems, are not ﬂexible enough to encompass lifelong learningproblems wherein the action set size changes over time. In this section we extend the standard MDP framework tomodel this setting.In real-world problems where the set of possible actions changes, there is often underlying structure in the set ofall possible actions (those that are available, and those that may become available). For example, tutorial videoscan be described by feature vectors that encode their topic, difﬁculty, length, and other attributes; in robot controltasks, primitive locomotion actions like left, right, up, and down could be encoded by their change to the Cartesiancoordinates of the robot, etc. Critically, we will not assume that the agent knows this structure, merely that it exists.If actions are viewed from this perspective, then the set of all possible actions (those that are available at one point intime, and those that might become available at any time in the future) can be viewed as a vector-space,

E ⊆ R d .To formalize the lifelong MDP, we ﬁrst introduce the necessary variables that govern when and how new actions areadded. We denote the episode number using τ . Let I τ ∈ { , } be a random variable that indicates whether a new setof actions are added or not at the start of episode τ , and let frequency F : N → [0 , be the associated probabilitydistribution over episode count, such that Pr( I τ = 1) = F ( τ ) . Let U τ ∈ E be the random variable corresponding tothe set of actions that is added before the start of episode τ . When I τ = 1 , we assume that U τ (cid:54) = ∅ , and when I τ = 0 ,we assume that U τ = ∅ . Let D τ be the distribution of U τ when I τ = 1 , i.e., U τ ∼ D τ if I τ = 1 . We use D to denotethe set {D τ } consisting of these distributions. Such a formulation using I τ and D τ provides a ﬁne control of when andhow new actions can be incorporated. This allows modeling a large class of problems where both the distribution over32 PREPRINT - S

EPTEMBER

17, 2020Figure 20: Illustration of a lifelong MDP where M is the base MDP. For every change k , M K builds upon M k − by including the newly available set of actions A k . The internal structure in the space of actions is hidden and only aset of discrete actions is observed.the type of incorporated actions as well intervals between successive changes might be irregular. Often we will notrequire the exact episode number τ but instead require k , which denotes the number of times the action set is changed.Since we do not assume that the agent knows the structure associated with the action, we instead provide actions tothe agent as a set of discrete entities, A k . To this end, we deﬁne φ to be a map relating the underlying structureof the new actions to the observed set of discrete actions A k for all k , i.e., if the set of actions added is u k , then A k = { φ ( e i ) | e i ∈ u k } . Naturally, for most problems of interest, neither the underlying structure E , nor the set ofdistributions D , nor the frequency of updates F , nor the relation φ is known—the agent only has access to the observedset of discrete actions.We now deﬁne the lifelong Markov decision process (L-MDP) as L = ( M , E , D , F ) , which extends a base MDP M = ( S , A , P , R , γ, d ) . S is the set of all possible states that the agent can be in, called the state set. A is thediscrete set of actions available to the agent, and for M we deﬁne this set to be empty, i.e., A = ∅ . When the set ofavailable actions changes and the agent observes a new set of discrete actions, A k , then M k − transitions to M k , suchthat A in M k is the set union of A in M k − and A k . Apart from the available actions, other aspects of the L-MDPremain the same throughout. An illustration of the framework is provided in Figure 20. We use S t ∈ S , A t ∈ A , and R t ∈ R as random variables for denoting the state, action and reward at time t ∈ { , , . . . } within each episode. Theﬁrst state, S , comes from an initial distribution, d , and the reward function R is deﬁned to be only dependent on thestate such that R ( s ) = E [ R t | S t = s ] for all s ∈ S . We assume that R t ∈ [ − R max , R max ] for some ﬁnite R max . Thereward discounting parameter is given by γ ∈ [0 , . P is the state transition function, such that for all s, a, s (cid:48) , t , thefunction P ( s, a, s (cid:48) ) denotes the transition probability P ( s (cid:48) | s, e ) , where a = φ ( e ) . In the most general case, new actions could be completely arbitrary and have no relation to the ones seen before. Insuch cases, there is very little hope of lifelong learning by leveraging past experience. To make the problem morefeasible, we resort to a notion of smoothness between actions. Formally, we assume that transition probabilities in anL-MDP are ρ − Lipschitz in the structure of actions, i.e., ∃ ρ > such that ∀ s, s (cid:48) , e i , e j (cid:107) P ( s (cid:48) | s, e i ) − P ( s (cid:48) | s, e j ) (cid:107) ≤ ρ (cid:107) e i − e j (cid:107) . (29)For any given MDP M k in L , an agent’s goal is to ﬁnd a policy, π k , that maximizes the expected sum of discountedfuture rewards. For any policy π k , the corresponding state value function is v π k ( s ) = E [ (cid:80) ∞ t =0 γ t R t | s, π k ] . Finding an optimal policy when the set of possible actions is large is difﬁcult due to the curse of dimensionality. In theL-MDP setting this problem might appear to be exacerbated, as an agent must additionally adapt to the changing levelsof possible performance as new actions become available. This raises the natural question: as new actions becomeavailable, how much does the performance of an optimal policy change?

If it ﬂuctuates signiﬁcantly, can a lifelonglearning agent succeed by continuously adapting its policy, or is it better to learn from scratch with every change tothe action set?To answer this question, consider an optimal policy, π ∗ k , for MDP M k , i.e., an optimal policy when considering onlypolicies that use actions that are available during the k th episode. We now quantify how sub-optimal π ∗ k is relative tothe performance of a hypothetical policy, µ ∗ , that acts optimally given access to all possible actions. For notational ease, (a) we overload symbol P for representing both probability mass and density; (b) we assume that the stateset is ﬁnite, however, our primary results extend to MDPs with continuous states. PREPRINT - S

EPTEMBER

17, 2020

Theorem 11.1.

In an L-MDP, let (cid:15) k denote the maximum distance in the underlying structure of the closest pair ofavailable actions, i.e., (cid:15) k := sup a i ∈A inf a j ∈A (cid:107) e i − e j (cid:107) , then v µ ∗ ( s ) − v π ∗ k ( s ) ≤ γρ(cid:15) k (1 − γ ) R max . (30)The proof is available in [14]. With a bound on the maximum possible sub-optimality, Theorem 11.1 presents animportant connection between achievable performances, the nature of underlying structure in the action space, and aproperty of available actions in any given M k . Using this, we can make the following conclusion. Corollary 11.2.

Let

Y ⊆ E be the smallest closed set such that, P ( U k ⊆ Y ) = 1 . We refer to Y as the element-wise-support of U k . If ∀ k , the element-wise-support of U k in an L-MDP is E , then as k → ∞ the sub-optimality vanishes.That is, lim k →∞ v µ ∗ ( s ) − v π ∗ k ( s ) → . Through Corollary 11.2, we can now establish that the change in optimal performance will eventually converge to zeroas new actions are repeatedly added. An intuitive way to observe this result would be to notice that every new actionthat becomes available indirectly provides more information about the underlying, unknown, structure of E . However,in the limit, as the size of the available action set increases, the information provided by each each new action vanishesand thus performance saturates.Certainly, in practice, we can never have k → ∞ , but this result is still advantageous. Even when the underlying struc-ture E , the set of distributions D , the change frequency F , and the mapping relation φ are all unknown , it establishesthe fact that the difference between the best performances in successive changes will remain bounded and will notﬂuctuate arbitrarily. This opens up new possibilities for developing algorithms that do not need to start from scratchafter new actions are added, but rather can build upon their past experiences using updates to their existing policiesthat efﬁciently leverage estimates of the structure of E to adapt to new actions. Theorem 11.1 characterizes what can be achieved in principle, however, it does not specify how to achieve it—howto ﬁnd π ∗ k efﬁciently. Using any parameterized policy, π , which acts directly in the space of observed actions, suffersfrom one key practical drawback in the L-MDP setting. That is, the parameterization is deeply coupled with thenumber of actions that are available. That is, not only is the meaning of each parameter coupled with the number ofactions, but often the number of parameters that the policy has is dependent on the number of possible actions. Thismakes it unclear how the policy should be adapted when additional actions become available. A trivial solution wouldbe to ignore the newly available actions and continue only using the previously available actions. However, this isclearly myopic, and will prevent the agent from achieving the better long term returns that might be possible using thenew actions.To address this parameterization-problem, instead of having the policy, π , act directly in the observed action space, A , we propose an approach wherein the agent reasons about the underlying structure of the problem in a way thatmakes its policy parameterization invariant to the number of actions that are available. To do so, we split the policyparameterization into two components. The ﬁrst component corresponds to the state conditional policy responsible formaking the decisions, β : S × ˆ E → [0 , , where ˆ E ∈ R d . The second component corresponds to ˆ φ : ˆ E × A → [0 , ,an estimator of the relation φ , which is used to map the output of β to an action in the set of available actions. That is,an E t ∈ ˆ E is sampled from β ( S t , · ) and then ˆ φ ( E t ) is used to obtain the action A t . Together, β and ˆ φ form a completepolicy, and ˆ E corresponds to the inferred structure in action space.One of the prime beneﬁts of estimating φ with ˆ φ is that it makes the parameterization of β invariant to the cardinalityof the action set—changing the number of available actions does not require changing the number of parameters of β . Instead, only the parameterization of ˆ φ , the estimator of the underlying structure in action space, must be modiﬁedwhen new actions become available. We show next that the update to the parameters of ˆ φ can be performed using supervised learning methods that are independent of the reward signal and thus typically more efﬁcient than RLmethods.While our proposed parameterization of the policy using both β and ˆ φ has the advantages described above, the per-formance of β is now constrained by the quality of ˆ φ , as in the end ˆ φ is responsible for selecting an action from A .Ideally we want ˆ φ to be such that it lets β be both: (a) invariant to the cardinality of the action set for practical reasonsand (b) as expressive as a policy, π , explicitly parameterized for the currently available actions. Similar trade-offs34 PREPRINT - S

EPTEMBER

17, 2020have been considered in the context of learning optimal state-embeddings for representing sub-goals in hierarchicalRL [43]. For our lifelong learning setting, we build upon their method to efﬁciently estimate ˆ φ in a way that providesbounded sub-optimality. Speciﬁcally, we make use of an additional inverse dynamics function, ϕ , that takes as inputtwo states, s and s (cid:48) , and produces as output a prediction of which e ∈ E caused the transition from s to s (cid:48) . Since theagent does not know φ , when it observes a transition from s to s (cid:48) via action a , it does not know which e caused thistransition. So, we cannot train ϕ to make good predictions using the actual action, e , that caused the transition. Instead,we use ˆ φ to transform the prediction of ϕ from e ∈ E to a ∈ A , and train both ϕ and ˆ φ so that this process accuratelypredicts which action, a , caused the transition from s to s (cid:48) . Moreover, rather than viewing ϕ as a deterministic functionmapping states s and s (cid:48) to predictions e , we deﬁne ϕ to be a distribution over E given two states, s and s (cid:48) .For any given M k in L-MDP L , let β k and ˆ φ k denote the two components of the overall policy and let π ∗∗ k denotethe best overall policy that can be represented using some ﬁxed ˆ φ k . The following theorem bounds the sub-optimalityof π ∗∗ k . Theorem 11.3.

For an L-MDP M k , If there exists a ϕ : S × S × ˆ E → [0 , and ˆ φ k : ˆ E × A → [0 , such that sup s ∈S ,a ∈A KL (cid:16) P ( S t +1 | S t = s, A t = a ) (cid:107) P ( S t +1 | S t = s, A t = ˆ A ) (cid:17) ≤ δ k / , (31) where ˆ A ∼ ˆ φ k ( ·| ˆ E ) and ˆ E ∼ ϕ ( ·| S t , S t +1 ) , then v µ ∗ ( s ) − v π ∗∗ k ( s ) ≤ γ ( ρ(cid:15) k + δ k )(1 − γ ) R max . See [14] for the proof. By quantifying the impact ˆ φ has on the sub-optimality of achievable performance, Theorem11.3 provides the necessary constraints for estimating ˆ φ . At a high level, Equation (31) ensures ˆ φ to be such that itcan be used to generate an action corresponding to any s to s (cid:48) transition. This allows β to leverage ˆ φ and choose therequired action that induces the state transition needed for maximizing performance. Thereby, following (31), sub-optimality would be minimized if ˆ φ and ϕ are optimized to reduce the supremum of KL divergence over all s and a . Inpractice, however, the agent does not have access to all possible states, rather it has access to a limited set of samplescollected from interactions with the environment. Therefore, instead of the supremum, we propose minimizing theaverage over all s and a from a set of observed transitions, L ( ˆ φ, ϕ ):= (cid:88) s ∈S (cid:88) a ∈A k P ( s, a ) KL ( P ( s (cid:48) | s, a ) (cid:107) P ( s (cid:48) | s, ˆ a )) . (32)Equation (32) suggests that L ( ˆ φ, ϕ ) would be minimized when ˆ a equals a , but using (32) directly in the current formis inefﬁcient as it requires computing KL over all probable s (cid:48) ∈ S for a given s and a . To make it practical, we makeuse of the following property. Property 11.4.

For some constant C, −L ( ˆ φ, ϕ ) is lower bounded by (cid:88) s ∈S (cid:88) a ∈A k (cid:88) s (cid:48) ∈S P ( s, a, s (cid:48) ) (cid:32) E (cid:104) log ˆ φ (ˆ a | ˆ e ) (cid:12)(cid:12)(cid:12) ˆ e ∼ ϕ ( ·| s, s (cid:48) ) (cid:105) − KL (cid:16) ϕ (ˆ e | s, s (cid:48) ) (cid:13)(cid:13)(cid:13) P (ˆ e | s, s (cid:48) ) (cid:17)(cid:33) + C. (33)As minimizing L ( ˆ φ, ϕ ) is equivalent to maximizing −L ( ˆ φ, ϕ ) , we consider maximizing the lower bound obtainedfrom Property 11.4. In this form, it is now practical to optimize (33) just by using the observed ( s, a, s (cid:48) ) samples. Asthis form is similar to the objective for variational auto-encoder, inner expectation can be efﬁciently optimized usingthe reparameterization trick [33]. P (ˆ e | s, s (cid:48) ) is the prior on ˆ e , and we treat it as a hyper-parameter that allows the KLto be computed in closed form.Importantly, note that this optimization procedure only requires individual transitions, s, a, s (cid:48) , and is independent ofthe reward signal. Hence, at its core, it is a supervised learning procedure. This means that learning good parametersfor ˆ φ tends to require far fewer samples than optimizing β (which is an RL problem). This is beneﬁcial for ourapproach because ˆ φ , the component of the policy where new parameters need to be added when new actions becomeavailable, can be updated efﬁciently. As both β and ϕ are invariant to action cardinality, they do not require newparameters when new actions become available. Additional implementation level details are available in Appendix F.35 PREPRINT - S

EPTEMBER

17, 2020Figure 21: An illustration of a typical performance curve for a lifelong learning agent. The point ( a ) correspondsto the performance of the current policy in M k . The point ( b ) corresponds to the performance drop resulting as aconsequence of adding new actions. We call the phase between (a) and (b) as the adaptation phase, which aims atminimizing this drop when adapting to new set of actions. The point ( c ) corresponds to the improved performance in M k +1 by optimizing the policy to leverage the new set of available actions. µ ∗ represents the best performance of thehypothetical policy which has access to the entire structure in the action space. When a new set of actions, A k +1 , becomes available, the agent should leverage the existing knowledge and quicklyadapt to the new action set. Therefore, during every change in M k , the ongoing best components of the policy, β ∗ k − and φ ∗ k − , in M k − are carried over, i.e., β k := β ∗ k − and ˆ φ k := ˆ φ ∗ k − . For lifelong learning, the following propertyillustrates a way to organize the learning procedure so as to minimize the sub-optimality in each M k , for all k . Property 11.5. (Lifelong Adaptation and Improvement) In an L-MDP, let ∆ denote the difference of performance be-tween v µ ∗ and the best achievable using our policy parameterization, then the overall sub-optimality can be expressedas, v µ ∗ ( s ) − v β ˆ φ M ( s ) = ∞ (cid:88) k =1 (cid:16) v β k ˆ φ ∗ k M k ( s ) − v β k ˆ φ k M k ( s ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) Adaptation + ∞ (cid:88) k =1 (cid:16) v β ∗ k ˆ φ ∗ k M k ( s ) − v β k ˆ φ ∗ k M k ( s ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) Policy Improvement +∆ , (34) where M k is used in the subscript to emphasize the respective MDP in L . Property 11.5 illustrates a way to understand the impact of β and ˆ φ by splitting the learning process into an adaptationphase and a policy improvement phase. These two iterative phases are the crux of our algorithm for solving an L-MDP L . Based on this principle, we call our algorithm LAICA: lifelong adaptation and improvement for changing actions .Whenever new actions become available, adaptation is prone to cause a performance drop as the agent has no infor-mation about when to use the new actions, and so its initial uses of the new actions may be at inappropriate times.Following Property 11.4, we update ˆ φ so as to efﬁciently infer the underlying structure and minimize this drop. Thatis, for every M k , ˆ φ k is ﬁrst adapted to ˆ φ ∗ k in the adaptation phase by adding more parameters for the new set of actionsand then optimizing (33). After that, ˆ φ ∗ k is ﬁxed and β k is improved towards β ∗ k in the policy improvement phase, byupdating the parameters of β k using the policy gradient theorem [61]. These two procedures are performed sequen-tially whenever M k − transitions to M k , for all k , in an L-MDP L . An illustration of the procedure is presented inFigure 21.A step-by-step pseudo-code for the LAICA algorithm is available in Algorithm 1. The crux of the algorithm is basedon the iterative adapt and improve procedure obtained from Property 11.5.We begin by initializing the parameters for β ∗ , ˆ φ ∗ and ϕ ∗ . In Lines to , for every change in the set of availableactions, instead of re-initializing from scratch, the previous best estimates for β, ˆ φ and ϕ are carried forward to build36 PREPRINT - S

EPTEMBER

17, 2020upon existing knowledge. As β and ϕ are invariant to the cardinality of the available set of actions, no new parametersare required for them. In Line we add new parameters in the function ˆ φ to deal with the new set of available actions.To minimize the adaptation drop, we make use of Property 11.4. Let L lb denote the lower bound for L , such that, L lb ( ˆ φ, ϕ ) := E (cid:104) log ˆ φ ( ˆ A t | ˆ E t ) (cid:12)(cid:12)(cid:12) ϕ ( ˆ E t | S t , S t +1 ) (cid:105) − λ KL (cid:16) ϕ ( ˆ E t | S t , S t +1 ) (cid:13)(cid:13)(cid:13) P ( ˆ E t | S t , S t +1 ) (cid:17) . Note that following the literature on variational auto-encoders, we have generalized (33) to use a Lagrangian λ toweight the importance of KL divergence penalty [26]. When λ = 1 , it degenrates to (33). We set the prior P (ˆ e | s, s (cid:48) ) to be an isotropic normal distribution, which also allows KL to be computed in closed form [33]. From Line to inthe Algorithm 1, random actions from the available set of actions are executed and their corresponding transitions arecollected in a buffer. Samples from this buffer are then used to maximize the lower bound objective L lb and adapt theparameters of ˆ φ and ϕ . The optimized ˆ φ ∗ is then kept ﬁxed during policy improvement.Lines – correspond to the standard policy gradient approach for improving the performance of a policy. In ourcase, the policy β ﬁrst outputs a vector ˆ e which gets mapped by ˆ φ ∗ to an action. The observed transition is then usedto compute the policy gradient [61] for updating the parameters of β towards β ∗ . If a critic is used for computingthe policy gradients, then it is also subsequently updated by minimizing the TD error [60]. This iterative process ofadaption and policy improvement continues for every change in the action set size. Algorithm 2:

Lifelong Adaptation and Improvement for Changing Actions (LAICA) Initialize β ∗ , ˆ φ ∗ , ϕ ∗ . for change k = 1 , ... do β k ← β ∗ k − ϕ k ← ϕ ∗ k − ˆ φ k ← ˆ φ ∗ k − Add parameters in ˆ φ k for new actions Buffer B = {} for episode = 0 , , ... do for t = 0 , , ... do Execute random a t and observe s t +1 Add transition to B for iteration = 0 , , ... do Sample batch b ∼ B Update ˆ φ k and ϕ k by maximizing L lb ( ˆ φ k , ϕ k ) for b for episode = 0 , , ... do for t = 0 , , ... do Sample ˆ e t ∼ β k ( ·| s t ) Map ˆ e t to an action a t using ˆ φ ∗ k ( e ) Execute a t and observe s t +1 , r t Update β k using any policy gradient algorithm Update critic by minimizing TD error. Reuse pastknowledge. Adapt ˆ φ k to ˆ φ ∗ k Improve β k to β ∗ k In this section, we aim to empirically compare the following methods, • Baseline(1): The policy is re-initialised and the agent learns from scratch after every change. • Baseline(2): New parameters corresponding to new actions are added/stacked to the existing policy (andpreviously learned parameters are carried forward as-is). • LAICA(1): The proposed approach that leverages the structure in the action space. To act in continuous spaceof inferred structure, we use DPG [56] to optimize β .37 PREPRINT - S

EPTEMBER

17, 2020Figure 22: Lifelong learning experiments with a changing set of actions in the recommender system domains. Thelearning curves correspond to the running mean of the best performing setting for each of the algorithms. The shadedregions correspond to standard error obtained using trials. Vertical dotted bars indicate when the set of actions waschanged. • LAICA(2): A variant of LAICA which uses an actor-critic [60] to optimize β .To demonstrate the effectiveness of our proposed method(s) on lifelong learning problems, we consider a maze envi-ronment and two domains corresponding to real-world applications, all with a large set of changing actions. For eachof these domains, the total number of actions were randomly split into ﬁve equal sets. Initially, the agent only had theactions available in the ﬁrst set and after every change the next set of actions was made available additionally. In thefollowing paragraphs we brieﬂy outline the domains. Case Study: Real-World Recommender Systems.

We consider the following two real-world applications of large-scale recommender systems that require decision making over multiple time steps and where the number of possibledecisions varies over the lifetime of the system. • A web-based video-tutorial platform, that has a recommendation engine to suggest a series of tutorial videos.The aim is to meaningfully engage the users in a learning activity. In total, tutorials were considered forrecommendation. • A professional multi-media editing software, where sequences of tools inside the software need to be recom-mended. The aim is to increase user productivity and assist users in quickly achieving their end goal. In total, tools were considered for recommendation.For both of these applications, an existing log of user’s click stream data was used to create an n -gram based MDPmodel for user behavior [55]. Sequences of user interaction were aggregated to obtain over million clicks and . billion user clicks for the tutorial recommendation and the tool recommendation task, respectively. The MDPhad continuous state-space, where each state consisted of the feature descriptors associated with each item (tutorial ortool) in the current n -gram. 38 PREPRINT - S

EPTEMBER

17, 2020

The plots in Figure 22 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1)can be attributed to its policy parameterization. The decision making component of the policy, β , being invariant to theaction cardinality can be readily leveraged after every change without having to be re-initialized. This demonstratesthat efﬁciently re-using past knowledge can improve data efﬁciency over the approach that learns from scratch everytime.Compared to Baseline(2), which also does not start from scratch and reuses existing policy, we notice that the variantsof LAICA algorithm still perform favorably. As evident from the plots in Figure 22, while Baseline(2) does a goodjob of preserving the existing policy, it fails to efﬁciently capture the beneﬁt of new actions.While the policy parameters in both LAICA and Baseline(2) are improved using policy gradients, the superior per-formance of LAICA can be attributed to the adaptation procedure incorporated in LAICA which aims at efﬁcientlyinferring the underlying structure in the space of actions. Overall LAICA(2) performs almost twice as well as both thebaselines on all of the tasks considered.Note that even before the ﬁrst addition of the new set of actions, the proposed method performs better than thebaselines. This can be attributed to the fact that the proposed method efﬁciently leverages the underlying structure inthe action set and thus learns faster. Similar observations have been made previously [17, 25, 4, 62].

12 Cognitive Bias Aware

While SRs are beginning to ﬁnd their way in academia and industry, other recommendation technologies such ascollaborative ﬁltering [6] and contextual bandits [39] have been the mainstream methods. There are hundreds ofpapers every year in top-tier machine learning conferences advancing the state of the art in collaborative ﬁltering andbandit technologies. Nonetheless, all these systems do not truly understand people but rather naively optimize expectedutility metrics such as click through rate. People, do not perceive expected values in a rational fashion [74]. In fact,people have evolved with various cognitive biases to simplify information processing. Cognitive biases are systematicpatterns of deviation from norm or rationality in judgment. They are rules of thumb that help people make sense of theworld and reach decisions with relative speed. Some of these biases include the perception of risk, collective effectsand long-term decision making. In this ﬁnal section we argue that the next generation of recommendation systemsneeds to incorporate human cognitive biases. We would like to build recommendation and personalization systems thatare aware of these biases. We would like to do it in a fashion that is win-win for both the marketer and the consumer.Such technology is not studied in academia and does not exist in the industry [63].It has becoming a well-established result that humans do not reason by maximizing their expected utility, and yetmost work in artiﬁcial intelligence and machine learning continues to be based on idealized mathematical models ofdecisions [40, 19, 44]. There are many studies that show that given two choices that are logically equivalent, peoplewill prefer one to another based on how the information is presented to them, even if the choices made violate theprinciple of maximizing expected utility [3]. The Allais paradox is a classic example that demonstrates this type ofinconsistency in the choices people make when presented with two different gambling strategies. It was found thatpeople often preferred a gambling strategy with a lower expected utility but with more certain positive gains over astrategy where expected utility is higher at the cost of more uncertainty. Furthermore, there are a number of biasesthat have been explored in the context of marketing and eCommerce that inﬂuence the manner in which consumersmake purchasing decisions. An example of such a bias is the decoy effect which refers to the phenomenon whenconsumers ﬂip their preference for one item over another when presented with a third item with certain characteristics.Furthermore, consider how fatigue and novelty bias can be incorporated into a movie or book recommendation system.While a recommendation system may have identiﬁed a user’s preference for a particular genre of movies or books,novelty bias, which is a well studied phenomenon in behavioral psychology, suggests that novelty in options mightactually yield uplift in system results if modeled appropriately.

Next we list different types of biases that have a direct impact on decision making and that we envision will beexplicitly modelled in future personalization systems.

Biases related to loss, risk and ambiguity are some of the most well studied from a modeling perspective and arerelevant to any eCommerce recommendation. These biases could be incorporated by modeling the degree of certainty39

PREPRINT - S

EPTEMBER

17, 2020in experiencing satisfaction from the purchase of a product, or alternatively modeling the risk of regret associated withthe purchase. In [74] speciﬁc curves describing the degree of loss aversion are derived for gambling examples. Froma behavioral science perspective these biases can be described as follows: • Loss (Regret) Aversion : Motivated by the tendency to avoid the possibility of experiencing regret aftermaking a choice with a negative outcome; loss aversion refers to the asymmetry between the afﬁnity and theaversion to the same amount of gain and loss respectively. In other words, this bias refers to the phenomenonwhereby a person has a higher preference towards not losing something to winning that same thing. I.e losing$ for example results in a greater loss of satisfaction than the gain in satisfaction that is caused by winning$ . • Risk Aversion : This refers to the tendency to prefer a certain positive outcome over a risky one despite thechance of it being more proﬁtable (in expectation) than the certain one. • Ambiguity Aversion : This phenomenon in decision making refers to a general preference for known risksrather than unknown ones. The difference between risk and ambiguity aversion is subtle. Ambiguity aversionrefers to the aversion to not having information about the degree of risk involved. [5, 19]

Another set of biases we believe are important to model relate to how the value of recommended items is likely tobe perceived in the context of other items or recommendations. These comparative differences are important whenrecommendations are considered in sequence or when collectively surfaced such as multiple ads on the same webpage. These include: • Contrast Effect : An object is perceived differently when shown in isolation than when shown in contrast toother objects. For example a $ object might seem inexpensive next to a $ object but expensive next to anobject priced at $ . [51] • Decoy Effect : This effect is common in consumer and marketing scenarios whereby a consumer’s preferencebetween two items reverses when they are presented with a third option that is clearly inferior to one of thetwo items and only partially inferior to the second. The Decoy effect can be viewed as a speciﬁc instance ora special case of the contrast effect described above. • Distinction Bias : This refers to the situation where two items are viewed as more distinct (from each other)when viewed together than when viewed separately. • Framing effect : This refers to the effect on decision making of the manner in which something is pre-sented to a user. I.e. the same information presented in different ways can lead to different decision outcomes.

Finally, we consider biases that might arise in different parts of the decision making process over time. These biaseswill all play a role in optimizing future recommendation systems for long term user value. Some examples of theseinclude: • Choice Supportive Bias : Also referred to as post-purchase rationalization, this bias refers to the tendencyto remember past choices as justiﬁable and in some cases retroactively ascribe a positive sentiment to them.[42] • Anchoring (Conﬁrmation) Bias : paying more attention to information that supports one’s opinions whileignoring or underscoring another set of information. This type of bias includes resorting to preconceivedopinions when encountering something new. • Hyperbolic Discounting Effect : This bias refers to the tendency to prefer immediate payoffs than thosethat occur later in time. As an example related to our application of recommendation systems, consideringa consumer’s preference to receive an object sooner rather than at a later time period (for instance due toshipping time) can have an direct impact on item preference. • Decision Fatigue : Although not an explicit cognitive bias, decision fatigue is a phenomenon worth exploringas it refers to the manner in which decision making deviates from the expected norm as a result of fatigueinduced by long periods of decision making. 40

PREPRINT - S

EPTEMBER

17, 2020 • Selective Perception : This refers to the tendency for expectation to bias ones perception of certainphenomena. [15]

In this section, we have argued that the next-generation of personalization systems should be designed by explicitlymodeling people’s cognitive biases. Such a redesign will require developing algorithms that have a more realisticmodel of how humans perceive the world and will be able to exploit human deviations from perfect rationality. − $100 +$100 Value OutcomeLOSSES GAINS Expectation OptimizationHuman Perception

Figure 23: A piece-wise non-linear model of modelling human perception of gains and losses. Algorithms thatoptimize expectation view the gain of $100 and the loss of $100 as equal, but humans do not.Recent academic papers have shown that it is possible to model human biases by incorporating behavioral scienceframeworks, such as prospect theory [32, 74] into reinforcement learning algorithms [52]. Traditional work in re-inforcement learning is based on maximizing long-term expected utility. Prospect theory will require redesigningreinforcement learning models to reﬂect human nonlinear processing of probability estimates. These models incorpo-rate the cognitive bias of loss aversion using a theory [32, 74] that models perceived loss as asymmetric to gain, asillustrated in Figure 1. This captures a human perspective where potential gain of $100 is less preferable to avoiding apotential loss of $100. Algorithms that simply maximize expectation would treat these two outcomes equally.We envision that future personalization algorithms will incorporate such models for a wide variety of human cognitivebiases, adjusting the steepness of the curves and the value of the inﬂection points accordingly for each person based ontheir overall character, their immediate context and the history of their interactions with the system. Unlike contextualbandits that cannot differentiate between a ‘visit’ and a ‘user’, the next generation of personalization systems willconsider a sequence of recent events and interactions with the system when making a recommendation, and thus beable to incorporate surprise and novelty into the recommendation sequence according to the user’s modelled proﬁle.We also argue that the cognitive bias model will be useful in deciding how best to present the recommendation to theuser, for example a person with a high familiarity bias might only want to invest in a stock that she knows, and be lesslikely to keep a more diverse portfolio. A recommendation system that had an explicit model of this bias could presentthe diversiﬁed portfolio in a way that makes it appear more familiar, for example mentioning the user’s friends whohad similar diversiﬁed portfolios.We expect that these next-generation personalization algorithms may initially require a high density of data, but thatthis dependence on data may be ameliorated as we move beyond modelling solely based on click streams, and exploit-ing other data sources available from sensor rich environments. In particular we envision curated experiences such asvisits to theme parks, cruise ships or hotel chains with loyalty programs. In such environments rich data is available,activity preferences, shop purchases, facility utilization and user queries could all contribute to help train a sufﬁcientlyeffective model.Designing an SR system that understands how people actually reason has a huge potential to retain users and farmore effectively market products than a system that mathematically optimizes some convenient but non-human-likeoptimization function. Understanding how people actually make decisions will help us match them with products thatwill truly make them happy and keep them engaged in using the system for the long term.41

PREPRINT - S

EPTEMBER

17, 2020

13 Summary and Conclusions

In this paper we demonstrated through various real world use-cases the advantages of strategic recommendation sys-tems and their implementation using RL. To make RL practical, we solved many fundamental challenges, such asevaluating policies ofﬂine with high conﬁdence, safe deployment, non-stationarity, building systems from passive datathat do not contain past recommendations, resource constraint optimization in multi-user systems, and scaling to largeand dynamic actions spaces. Finally we presented ideas for the what we believe will be the next generation of SRsthat would truly understand people by incorporating human cognitive biases.

References [1] Abbasi-Yadkori, Y., Szepesv´ari, C.: Bayesian optimal control of smoothly parameterized systems. In: UAI, pp.1–11 (2015)[2] Akaike, H.: A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control (6),716–723 (1974). DOI 10.1109/tac.1974.1100705. URL http://dx.doi.org/10.1109/tac.1974.1100705 [3] Allais, M.: Le comportement de l’homme rationnel devant le risque: Critique des postulats et axiomes de l’ecoleamericaine. Econometrica (4), 503–546 (1953)[4] Bajpai, A.N., Garg, S., et al.: Transfer of deep reactive policies for mdp planning. In: Advances in NeuralInformation Processing Systems, pp. 10965–10975 (2018)[5] Baron, J.: Normative, descriptive and prescriptive responses. Behavioral and Brain Sciences (1), 32–42 (1994).DOI 10.1017/S0140525X0003329X[6] Bell, R., Koren, Y., Volinsky, C.: Matrix factorization techniques for recommender systems. IEEE Computer ,30–37 (2009)[7] Benjamin, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multipletesting. Journal of the Royal Statistical Society (1), 289–300 (1995)[8] Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor–critic algorithms. Automatica (11),2471–2482 (2009)[9] Borkar, V.S.: Stochastic approximation: a dynamical systems viewpoint, vol. 48. Springer (2009)[10] Borkar, V.S., Konda, V.R.: The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana (4), 525–543 (1997)[11] Breiman, L.: Random forests. Mach. Learn. (1), 5–32 (2001). DOI 10.1023/A:1010933404324. URL http://dx.doi.org/10.1023/A:1010933404324 [12] Chad`es, I., Carwardine, J., Martin, T.G., Nicol, S., Sabbadin, R., Buffet, O.: MOMDPs: A solution for modellingadaptive management problems. In: Proceedings of the 26th AAAI Conference on Artiﬁcial Intelligence, pp.267–273 (2012)[13] Chandak, Y., Theocharous, G., Kostas, J., Jordan, S.M., Thomas, P.S.: Learning action representations for re-inforcement learning. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019,9-15 June 2019, Long Beach, California, USA, pp. 941–950 (2019). URL http://proceedings.mlr.press/v97/chandak19a.html [14] Chandak, Y., Theocharous, G., Nota, C., Thomas, P.S.: Lifelong learning with a changing action set. Thirty-fourth Conference on Artiﬁcial Intelligence (AAAI 2020) abs/1906.01770 (2020). URL http://arxiv.org/abs/1906.01770 [15] Chandler, D., Munday, R.: A Dictionary of Media and Communication, 1 edn. Oxford University Press (2011)[16] Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. arXiv preprint arXiv:1205.4839 (2012)[17] Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris,T., Coppin, B.: Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679(2015)[18] Efron, B.: Better bootstrap conﬁdence intervals. Journal of the American Statistical Association (397), 171–185 (1987)[19] Ellsberg, D.: Risk, ambiguity, and the savage axioms. The quarterly journal of economics pp. 643–669 (1961)[20] Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine LearningResearch , 503–556 (2005) 42 PREPRINT - S

EPTEMBER

17, 2020[21] Gabadinho, A., Ritschard, G.: Analyzing state sequences with probabilistic sufﬁx trees: The pst r package.Journal of Statistical Software , 1–39 (2016). DOI 10.18637/jss.v072.i03[22] Ghahramani, Z.: An introduction to hidden Markov models and bayesian networks. International journal ofpattern recognition and artiﬁcial intelligence (01), 9–42 (2001)[23] Gilmore, P.C., Gomory, R.E.: A linear programming approach to the cutting-stock problem. Operations Research (6), 849–859 (1961)[24] Hansen, N.: The CMA evolution strategy: a comparing review. In: J. Lozano, P. Larranaga, I. Inza, E. Bengoetxea(eds.) Towards a new evolutionary computation. Advances on estimation of distribution algorithms, pp. 75–102.Springer (2006)[25] He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., Ostendorf, M.: Deep reinforcement learning with a naturallanguage action space. arXiv preprint arXiv:1511.04636 (2015)[26] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference onLearning Representations, vol. 3 (2017)[27] Howard, R.A.: Information value theory. IEEE Transactions on Systems Science and Cybernetics (1), 22–26(1966). DOI 10.1109/TSSC.1966.300074[28] Hyndman, R.J., Khandakar, Y.: Automatic time series forecasting: the forecast package for R. Journal ofStatistical Software (3), 1–22 (2008). URL http://ideas.repec.org/a/jss/jstsof/27i03.html [29] Ie, E., Jain, V., Wang, J., Narvekar, S., Agarwal, R., Wu, R., Cheng, H., Lustman, M., Gatto, V., Covington,P., McFadden, J., Chandra, T., Boutilier, C.: Reinforcement learning for slate-based recommender systems: Atractable decomposition and practical methodology. CoRR abs/1905.12767 (2019). URL http://arxiv.org/abs/1905.12767 [30] Jonker, J., Piersma, N., den Poel, D.V.: Joint optimization of customer segmentation and marketing policy tomaximize long-term proﬁtability. Expert Systems with Applications (2), 159–168 (2004)[31] Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains.Artiﬁcial Intelligence (1-2), 99–134 (1998). DOI 10.1016/S0004-3702(98)00023-X[32] Kahneman, D., Tversky, A.: Prospect theory: An analysis of decision under risk. Econometrica (2), 263–291(1979)[33] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)[34] Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in neural information processing systems,pp. 1008–1014 (2000)[35] Konidaris, G.D., Osentoski, S., Thomas, P.S.: Value function approximation in reinforcement learning using theFourier basis. In: Proceedings of the Twenty-Fifth Conference on Artiﬁcial Intelligence, pp. 380–395 (2011)[36] Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: Efﬁcient point-based POMDP planning by approximating opti-mally reachable belief spaces. In: Robotics: Science and Systems (2008). DOI 10.15607/RSS.2008.IV.009[37] Lagoudakis, M., Parr, R.: Model-free least-squares policy iteration. In: Neural Information Processing Systems:Natural and Synthetic, pp. 1547–1554 (2001)[38] Li, L., Chu, W., Langford, J., Schapire, R.: A contextual-bandit approach to personalized news article recom-mendation. In: Proceedings of the 19th International Conference on World Wide Web, pp. 661–670 (2010).DOI 10.1145/1772690.1772758. URL http://doi.acm.org/10.1145/1772690.1772758 [39] Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual bandit approach to personalized news article recom-mendation. In: Proceedings of the 19th International Conference on World Wide Web (WWW ’10), pp. 661–670.ACM, Raleigh, North Carolina, USA (2010)[40] Machina, M.J.: Decision-making in the presence of risk. Science (4801), 537–543 (1987)[41] Martin, P., Becker, K.H., Bartlett, P., Chad`es, I.: Fast-tracking stationary MOMDPs for adaptive managementproblems. In: Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pp. 4531–4537 (2017)[42] Mather, M., Johnson, M.K.: Choice-supportive source monitoring: Do our decisions seem better to us as we age?Psychology and aging (4), 596–606 (2000)[43] Nachum, O., Gu, S., Lee, H., Levine, S.: Near-optimal representation learning for hierarchical reinforcementlearning. arXiv preprint arXiv:1810.01257 (2018)[44] von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press (1944)43 PREPRINT - S

EPTEMBER

17, 2020[45] de Nijs, F., Theocharous, G., Vlassis, N., de Weerdt, M.M., Spaan, M.T.J.: Capacity-aware sequential rec-ommendations. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2018, Stockholm, Sweden, July 10-15, 2018, pp. 416–424 (2018). URL http://dl.acm.org/citation.cfm?id=3237448 [46] de Nijs, F., Walraven, E., de Weerdt, M.M., Spaan, M.T.J.: Bounding the probability of resource constraintviolations in multi-agent MDPs. In: Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pp.3562–3568. AAAI Press (2017)[47] Osband, I., Russo, D., Van Roy, B.: (more) efﬁcient reinforcement learning via posterior sampling. In: C.J.C.Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Advances in Neural Information Pro-cessing Systems 26, pp. 3003–3011. Curran Associates, Inc. (2013). URL http://papers.nips.cc/paper/5185-more-efficient-reinforcement-learning-via-posterior-sampling.pdf [48] Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of Markov decision processes. Mathematics of OperationsResearch (3), 441–450 (1987). DOI 10.1287/moor.12.3.441[49] Pednault, E., Abe, N., Zadrozny, B.: Sequential cost-sensitive decision making with reinforcement learning.In: Proceedings of the eighth international conference on Knowledge discovery and data mining, pp. 259–268(2002). DOI 10.1145/775047.775086. URL http://doi.acm.org/10.1145/775047.775086 [50] Pfeifer, P.E., Carraway, R.L.: Modeling customer relationships as Markov chains. Journal of Interactive Market-ing pp. 43–55 (2000)[51] Plous, S.: The psychology of judgment and decision making. Mcgraw-Hill Book Company (1993)[52] Prashanth, L.A., Jie, C., Fu, M., Marcus, S., Szepesv´ari, C.: Cumulative prospect theory meets reinforcementlearning: Prediction and control. In: International Conference on Machine Learning, pp. 1406–1415 (2016)[53] Precup, D., Sutton, R.S., Singh, S.: Eligibility traces for off-policy policy evaluation. In: Proceedings of the 17thInternational Conference on Machine Learning, pp. 759–766 (2000)[54] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347 (2017)[55] Shani, G., Heckerman, D., Brafman, R.I.: An MDP-based recommender system. Journal of Machine LearningResearch (2005)[56] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algo-rithms. In: ICML (2014)[57] Silver, D., Newnham, L., Barker, D., Weller, S., McFall, J.: Concurrent reinforcement learning from customerinteractions. In: The Thirtieth International Conference on Machine Learning (2013)[58] Strehl, A., Langford, J., Li, L., Kakade, S.: Learning from logged implicit exploration data. In: Proceedings ofNeural Information Processing Systems 24, pp. 2217–2225 (2010)[59] Strens, M.J.A.: A bayesian framework for reinforcement learning. In: Proceedings of the Seventeenth Inter-national Conference on Machine Learning, ICML ’00, pp. 943–950. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA (2000). URL http://dl.acm.org/citation.cfm?id=645529.658114 [60] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, second edn. The MIT Press (2018). URL http://incompleteideas.net/book/the-book-2nd.html [61] Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learningwith function approximation. In: Advances in neural information processing systems, pp. 1057–1063 (2000)[62] Tennenholtz, G., Mannor, S.: The natural language of actions. International Conference on Machine Learning(2019)[63] Theocharous, G., Healey, J., Mahadevan, S., Saad, M.A.: Personalizing with human cognitive biases. In: AdjunctPublication of the 27th Conference on User Modeling, Adaptation and Personalization, UMAP 2019, Larnaca,Cyprus, June 09-12, 2019, pp. 13–17 (2019). DOI 10.1145/3314183.3323453. URL https://doi.org/10.1145/3314183.3323453 [64] Theocharous, G., Thomas, P.S., Ghavamzadeh, M.: Personalized ad recommendation systems for life-time valueoptimization with guarantees. In: Proceedings of the Twenty-Fourth International Joint Conference on ArtiﬁcialIntelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp. 1806–1812 (2015). URL http://ijcai.org/Abstract/15/257 PREPRINT - S

EPTEMBER

17, 2020[65] Theocharous, G., Vlassis, N., Wen, Z.: An interactive points of interest guidance system. In: CompanionPublication of the 22nd International Conference on Intelligent User Interfaces, IUI 2017, Limassol, Cyprus,March 13-16, 2017, pp. 49–52 (2017). DOI 10.1145/3030024.3040983. URL https://doi.org/10.1145/3030024.3040983 [66] Theocharous, G., Wen, Z., Abbasi, Y., Vlassis, N.: Scalar posterior sampling with applications. In: Advancesin Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems2018, NeurIPS 2018, 3-8 December 2018, Montr´eal, Canada., pp. 7696–7704 (2018). URL http://papers.nips.cc/paper/7995-scalar-posterior-sampling-with-applications [67] Thomas, P.: Bias in natural actor-critic algorithms. In: International Conference on Machine Learning, pp.441–448 (2014)[68] Thomas, P.S., Theocharous, G., Ghavamzadeh, M.: High-conﬁdence off-policy evaluation. In: Proceedings ofthe Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, January 25-30, 2015, Austin, Texas, USA., pp.3000–3006 (2015). URL [69] Thomas, P.S., Theocharous, G., Ghavamzadeh, M.: High conﬁdence policy improvement. In: Proceedings of the32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 2380–2388(2015). URL http://proceedings.mlr.press/v37/thomas15.html [70] Thomas, P.S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., Brunskill, E.: Predictive off-policy policyevaluation for nonstationary decision problems, with applications to digital marketing. In: Proceedings of theThirty-First AAAI Conference on Artiﬁcial Intelligence, February 4-9, 2017, San Francisco, California, USA.,pp. 4740–4745 (2017). URL http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/view/14550 [71] Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m:The new data in multimedia research. Commun. ACM (2), 64–73 (2016). DOI 10.1145/2812802. URL http://doi.acm.org/10.1145/2812802 [72] Tiejun, C., Yanli, W., H., B.S.: Fselector. Bioinformatics (21), 2851–2852 (2012). DOI 10.1093/bioinformatics/bts528. URL http://dx.doi.org/10.1093/bioinformatics/bts528 [73] Tirenni, G., Labbi, A., Berrospi, C., Elisseeff, A., Bhose, T., Pauro, K., Poyhonen, S.: The 2005 ISMS PracticePrize Winner Customer-Equity and Lifetime Management (CELM) Finnair Case Study. Marketing Science ,553–565 (2007)[74] Tversky, A., Kahneman, D.: Advances in prospect theory: Cumulative representation of uncertainty. Journal ofRisk and Uncertainty , 297–323 (1992). DOI 10.1007/BF00122574[75] Walraven, E., Spaan, M.T.J.: Column generation algorithms for constrained POMDPs. Journal of ArtiﬁcialIntelligence Research (2018)[76] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Ma-chine learning (3-4), 229–256 (1992)[77] Yost, K.A., Washburn, A.R.: The LP/POMDP marriage: optimization with imperfect information. Naval Re-search Logistics47