[PDF] Deep Reinforcement Learning-Based Product Recommender for Online Advertising

Abstract

In online advertising, recommender systems try to propose items from a list of products to potential customers according to their interests. Such systems have been increasingly deployed in E-commerce due to the rapid growth of information technology and availability of large datasets. The ever-increasing progress in the field of artificial intelligence has provided powerful tools for dealing with such real-life problems. Deep reinforcement learning (RL) that deploys deep neural networks as universal function approximators can be viewed as a valid approach for design and implementation of recommender systems. This paper provides a comparative study between value-based and policy-based deep RL algorithms for designing recommender systems for online advertising. The RecoGym environment is adopted for training these RL-based recommender systems, where the long short term memory (LSTM) is deployed to build value and policy networks in these two approaches, respectively. LSTM is used to take account of the key role that order plays in the sequence of item observations by users. The designed recommender systems aim at maximising the click-through rate (CTR) for the recommended items. Finally, guidelines are provided for choosing proper RL algorithms for different scenarios that the recommender system is expected to handle.

Full PDF

11 Deep Reinforcement Learning-Based ProductRecommender for Online Advertising

Milad Vaali Esfahaani, Yanbo Xue, and Peyman Setoodeh

Abstract —In online advertising, recommender systems try topropose items from a list of products to potential customersaccording to their interests. Such systems have been increas-ingly deployed in E-commerce due to the rapid growth ofinformation technology and availability of large datasets. Theever-increasing progress in the ﬁeld of artiﬁcial intelligencehas provided powerful tools for dealing with such real-lifeproblems. Deep reinforcement learning (RL) that deploys deepneural networks as universal function approximators can beviewed as a valid approach for design and implementationof recommender systems. This paper provides a comparativestudy between value-based and policy-based deep RL algorithmsfor designing recommender systems for online advertising. TheRecoGym environment is adopted for training these RL-basedrecommender systems, where the long short term memory(LSTM) is deployed to build value and policy networks in thesetwo approaches, respectively. LSTM is used to take account ofthe key role that order plays in the sequence of item observationsby users. The designed recommender systems aim at maximisingthe click-through rate (CTR) for the recommended items. Finally,guidelines are provided for choosing proper RL algorithms fordifferent scenarios that the recommender system is expected tohandle.

Index Terms —Recommender system, deep reinforcementlearning, online advertising, policy gradient method, DQN,LSTM.

I. I

NTRODUCTION

Recent advances in information technology and computernetworks as part and parcel of the fourth industrial revolutionhave led to generation and distribution of large amounts ofdata on a daily basis. Hence, handling big data has become adaily-life challenge for different communities. This challengestems from the fact that users must be able to effectively use allavailable resources. Therefore, they will face massive amountsof data, and in order to consciously choose what to be aware ofor what to attend to calls for a comprehensive understandingof the content of all options [1]. Such a challenge can bebest addressed by deploying software-deﬁned recommendersystems. For instance, in e-commerce, recommender systemsprovide selected useful and valuable information to costumerson one hand and increase sales by targeted advertising on theother hand.Different techniques and models can be used to implementa recommender system. Recently, machine learning algorithms- a branch of artiﬁcial intelligence - have been used for this

M. Vaali Esfahaani and P. Setoodeh are with the School of Electri-cal and Computer Engineering, Shiraz University, Shiraz, Iran (e-mail:[email protected]; [email protected]).Y. Xue is with the Career Science Lab, Beijing, China, and also with theDepartment of Control Engineering, Northeastern University Qinhuangdao,China (e-mail: [email protected]). purpose [2]. These algorithms are divided into three generalcategories: supervised, unsupervised, and reinforcement learn-ing [3]. In supervised learning, training data includes bothinput and target vectors, so that the desired output can beused in the training process. On the contrary, in unsupervisedlearning, the dataset does not include target vectors; therefore,such algorithms usually try to identify similarities betweensamples. In RL algorithms, interaction with the environmentallows for ﬁnding a suitable action that increases the rewardsignal [4].There are three major categories of RL algorithms: actor-only, critic-only, and actor-critic [5]. Actor-only algorithmssuch as policy gradient (PG) method learn a policy function,which determines the action to be taken in a certain state,critic-only algorithms such as Q-learning learn a value functionthat provides an estimate of the expected collected rewardover the control horizon, and actor-critic algorithms learnboth a policy function and a value function that evaluatesthe policy. In the context of deep RL, deep neural networks,which are universal function approximators, are trained toapproximate the optimal policy function, value function, orboth. In this framework, the corresponding neural networksare called policy and value networks.Deploying RL for design and implementation of recom-mender systems has two main advantages compared to alter-native approaches. First, for ﬁnding an optimal strategy, RLtakes account of the interacting nature of the recommendationprocess as well as probable changes in customers’ prioritiesover time. These systems consider dynamics of changingtastes and personal preferences, hence, they are able to behavemore intelligently in recommending goods. Second, RL-basedsystems gain an optimal strategy by maximizing the expectedaccumulated reward over time. Thus, the system with smallimmediate rewards identiﬁes an item to offer but provides animmense contribution to rewards for future recommendation[6]. Substantial work has been done on using RL in recom-mender systems aimed at different applications. Most of thereported work in the literature have used variations of eitherthe deep Q-network (DQN) or the PG algorithm.Collaborative ﬁltering was used in [7], which was then,reformulated as the k-arm bandit problem. This work as-sumed interdependence between items, which was previouslyignored. In the proposed model, interdependent items wereconsidered as clusters of arms, where the arms in a clustershowed the invisible similarities of the items. Using DQN,a system for news recommendation was proposed in [8].In the proposed recommender system, user’s return patternprovides additional information to the recommender system, a r X i v : . [ c s . A I] J a n which is complementary to the click/no-click feedback fromthe user. A video recommender was proposed in [9], whichuses a hybrid model combined with DQN. Recommendationsare based on the content of the video and the feeling itinduces. A movie recommender was presented in [10] thatuses double DQN to address the overestimation issue. In[11], a robust version of the DQN algorithm was suggestedfor tip recommendation. In [12], DQN was used for optimaladvertisement and ﬁnding optimal location to interpolate an adin a recommendation list. A social attentive version of DQNwas used in [13], which beneﬁts from preferences of bothusers and their social neighbours for estimating action-values.In [14], it was proposed to develop a generative adversarialnetwork (GAN) as a model for user behaviour, which can playthe role of the environment for a DQN-based recommendersystem.A conversational recommender system was proposed in[15], which integrates a dialog system and a recommendersystem. Using a policy network, the presented system providespersonalized recommendations based on the past collectiveinformation and the current conversation. The PG algorithmwas deployed in [16] for contextual recommendations byrelaxing some limiting assumptions such as simple rewardfunction and the static environment in the contextual banditmethod. A scaled-up version of the REINFORCE algorithmwas used in [17] to deal with a large action space andrecommend the top k best items instead of just the best one.Deploying the GAN architecture, in [18], a generative modelwas learned for the sequence of user preferences over time.The proposed algorithm can be viewed as a special case ofimitation learning. As a future work, authors suggested touse the trust region policy optimization method. The latentdistribution of user preferences was learned by an adversarialmodel in [19], and then, the PG method was used to updatethe recommender using a set of points of interest sampledaccording to the learned distribution. The notion of search-session Markov decision process was introduced in [20],which was then used for multi-step ranking. Authors useda version of the PG algorithm to ﬁnd the optimal policy. Arecommender system was proposed in [21] that uses the actor-critic algorithm with emphasis on dynamic adaptation.Here, we provide a comparative study between two cat-egories of deep RL-based recommender systems for onlineadvertising that respectively use value- and policy-based algo-rithms. As the value-based method, we use different variationsof the DQN algorithm and for the policy-based method we usethe PG algorithm. For different scenarios and various user/itemspaces, the best setting for each method is provided regard-ing appropriate metrics such as stability of recommendation,training speed, and computational burden. Aiming at designingstable and efﬁcient recommender systems, performance ofDQN and PG algorithms are thoroughly studied and compared.The contributions of this paper are three fold: • Improving the convergence rate of the recommendersystem such that the required number of episodes to reacha stable acceptable performance is reduced. • Providing a guaranteed level of performance by improv-ing the performance stability and reducing the variance of the CTR. • Providing a computational tool based on the performancearea contours to recommend a proper RL algorithmregarding the scenarios that the recommender system isexpected to handle.The mentioned improvements are achieved step by step.Two modiﬁcations are made in the original DQN architec-ture. First, instead of using the convolutional neural net-work (CNN), an LSTM-based value network is used in theDQN algorithm. Second, the value network is trained usingthe Huber loss function adopted from robust statistics. Thesuperiority of the modiﬁed DQN compared to the originalalgorithm is shown via simulations for different scenarios withdifferent state/action space dimensions. For the PG algorithm,the policy network is built using the LSTM as well. Boththe modiﬁed DQN and the PG algorithms owe their improvedperformance to the gating mechanisms in the LSTM cell thatcontrol the ﬂow of information and the ability of LSTM tocapture dependencies of clicking on the recommended itemsto the searching history. While the DQN-based recommendersystems properly handle low dimensional scenarios, the PG-based systems show superiority in handling high dimensionalcases.The rest of the paper is organized as follows. Section 2reviews the basic concepts and theorems of RL conformed foronline advertising. Section 3, presents details on the softwaretestbed used for implementing recommender systems in theRL framework as well as computer experiments and theirresults. Moreover, guidelines are provided for choosing thebest algorithm and setting in different circumstances. Finally,the paper concludes in section 4.II. R

EINFORCMENT LEARNING

As previously mentioned, RL refers to learning methods inwhich an agent interacts with its environment and learns viatrial and error. The underlying mathematical model of RL isthe Markov decision process (MDP), which is explained in thefollowing.

A. Markov decision process

Here, the online advertising problem is formulated as anMDP, where the agent is responsible for delivering onlineadvertisements to potential customers. Generally, every MDPconsists of a tuple (cid:104)S , A , P , R , γ (cid:105) with the following compo-nents: • State space , S , refers to the set of all users’ proﬁlesviewed as the environment with which the agent (i.e.,recommender system) interacts. This set may be partiallyused by the learning algorithm. Agent can implicitly learna proﬁle by observing the corresponding user’s behaviourover a time interval. • Action space , A , refers to all available items that canbe advertised. If a user watches some items on the e-commerce website for several time intervals, then, as anaction a t , the agent recommends an item on the publisherwebsite. Recommender System User

Action State Reward

Fig. 1: The recommender system as a reinforcement-learningagent. • Reward , R , is deﬁned based on the user’s feedback. Ifthe user clicks on the item that the agent recommended, r t will be 1, otherwise it will be 0. • Transition probability , P , is the probability that the statechanges from s t to s t +1 . This conditional probability isassumed to satisfy the Markov property. Simply put, thisproperty states that ignore the past and predict the futurebased on the present: p ( s t +1 | s t , a t , . . . , s , a ) = p ( s t +1 | s t , a t ) . • Discount factor , γ , takes a value between 0 and 1. Thisvalue is a measure of the relative importance of futurerewards compared to the instant reward. If γ = 0 , theagent pays attention only to the immediate reward, onthe contrary, if γ = 1 , the algorithm considers all thefuture rewards for taking the current action.Fig. 1 demonstrates the RL setting. Using the above deﬁ-nitions, the online advertising problem can be formulated asan MDP, which can then be solved by RL algorithms. Givena history of the corresponding MDP, the goal is to ﬁnd anoptimal policy for advertising π : S → A , which maximizesthe cumulative reward for a speciﬁc user. In the context ofonline advertising, it means that the optimal policy wouldmaximize the click-through rate.The goal of an agent is to ﬁnd an optimal policy, which isa policy that maximizes the cumulative rewards at each state,deﬁned as follows: V ∗ ( s ) = max π E π (cid:110)(cid:88) ∞ k =0 γ k r t + k | s t = s (cid:111) , (1)where π represents the policy that the agent will follow fromstate s , E π represents the expectation under this policy, t isthe current time step, and t + k refers to future time steps.Furthermore, r t + k is the immediate reward at time step ( t + k ) .The problem will be solved, if the recommender agent ﬁndsthe optimal state-action value function, which is deﬁned forall situations and actions as: Q ∗ ( s, a ) = max π E π (cid:110)(cid:88) ∞ k =0 γ k r t + k | s t = s, a t = a (cid:111) (2)Considering a ﬁnite control horizon, T , the objective functioncan be rewritten as follows: E π (cid:26)(cid:88) Tk =0 γ k r t + k | s t = s (cid:27) (3)An optimal policy can be found from either the optimalstate value function, V ∗ ( s ) , or the optimal state-action valuefunction, Q ∗ ( s, a ) . For problems with ﬁnite small state and action spaces, methods such as dynamic programming andtemporal-difference (TD) learning can be implemented using atable to store and update the value function. However, functionapproximation methods are widely used when it comes tohigh-dimensional problems in order to cope with the curseof dimensionality. These methods can beneﬁt from parametricmodels such as deep neural networks to represent the value andpolicy functions. Then, the algorithm will optimize the modelparameters based on the reward signal. As two successfuldeep RL algorithms that rely on deep learning for functionapproximation, we can refer to DQN and PG, which havebeen used for designing recommender systems. B. Deep Q-Network

DQN was originally used for playing Atari games [22]. InDQN, convolutional neural networks are used to approximatethe state-action value function and stochastic gradient descentis used to optimize the objective function and update theparameter values. The experience replay mechanism is alsoadopted to address the issues of interrelated data and unstabledistribution. The optimal state-action value function is theexpected value of r + γQ ∗ ( s (cid:48) , a (cid:48) ) . The Q-network is trainedby minimizing the mean squared error (MSE) loss function: L i ( θ i ) = E s,a ∼ ρ ( . ) (cid:104) ( y i − Q ( s, a ; θ i )) (cid:105) (4)Algorithm 1 presents the pseudo code of the DQN algorithmwith experience replay [22]. C. Policy Gradient

For the PG method, the decision-making process is carriedout by the agent ∀ s ∈ S , a ∈ A at any time step according toa parameterized policy: π ( s, a, θ ) = P r { a t = a | s t = s, θ } (5)where θ ∈ R l for l << |S| is an array of parameters usedto approximate the optimal policy function by a deep neuralnetwork. It is assumed that π is differentiable with respect toparameters, so that it can be optimized using gradient-basedmethods. Two formulations of this problem were presented in[23]: the average-reward and the start-state formulation, whichare reviewed in what follows.The average-reward formulation prioritizes the policiesbased on their long-term expected collected reward per step: ρ ( π ) = lim n →∞ n E { r + r + · · · + r n | π } = (cid:88) s d π ( s ) (cid:88) a π ( s, a ) R as Q π ( s, a ) = ∞ (cid:88) t =1 E { r t − ρ ( π ) | s = s, a = a, π } , ∀ s ∈ S , a ∈ A (6)where d π ( s ) = lim t →∞ Pr { s t = s | s , π } (7)denotes a stationary distribution over states given the policy,which is assumed to exist for all policies regardless of theinitial state, s . Algorithm 1

Deep Q-learning with Experience Replay [22] Initialize the replay memory D to capacity N Initialize the state-action value function Q with random weights for all episode = 1 : M do Initialize the sequence s = x and the preprocessed sequence φ = φ ( s ) for all t = 1 : T do With probability (cid:15) select a random action a t , Otherwise select a t = max a Q ∗ ( φ ( S t ) , a ; θ ) Execute the action a t in the emulator and observe the reward r t and image x t +1 Set s t +1 = s t , a t , x t +1 and preprocess φ t +1 = φ ( s t +1 ) Store the transition ( φ t , a t , r t , φ t +1 ) in D Sample the random minibatch of transitions ( φ j , a j , r j φ j +1 ) from D Set y i = (cid:26) r j for terminal φ j +1 r j + γ max a (cid:48) Q ( φ j +1 , a (cid:48) ; θ ) for non-terminal φ j +1 Perform a gradient descent step on ( y j − Q ( φ j , a j ; θ )) end for end for The start-state formulation focuses on long-term rewardcollected from a speciﬁc initial state, s , which is formulatedas: ρ ( π ) = E (cid:40) ∞ (cid:88) t =1 γ t − r t | s , π (cid:41) Q π ( s, a ) = E (cid:40) ∞ (cid:88) k =1 γ k − r t + k | s t = s, a t = a, π (cid:41) (8)where γ ∈ [0 , is the discount factor and d π ( s ) is consid-ered as a discounted weighting of states reachable from s following policy π : d π ( s ) = θ (cid:88) t =0 γ t Pr { s t = s | s , π } (9)To optimize the policy network parameters, gradient of theperformance metric with respect to parameters is needed,which is calculated for both formulations as [23]: ∂ρ∂θ = (cid:88) s d π ( s ) (cid:88) a ∂π ( s, a ) ∂θ Q π ( s, a ) (10)After training, the policy network can be fed with a descriptionof state to receive a distribution of actions at the output [24].III. E XPERIMENTS

Since RL algorithms work in an online manner, it isnecessary to use an environment that is capable of handlingrandom appearance of users, ads, and suggested items fortesting and evaluating such algorithms. Due to this fact, weadopted the RecoGym environment [25], which provides thesettings required for an RL problem. It allows for interactionbetween an agent and the environment, which leads to receiv-ing a reward from the environment, when a user clicks on arecommended item.The RecoGym environment is mainly designed for onlineadvertising, and the underlying process has two parts: • The organic session occurs on the e-commerce websiteduring which a user sees various items. • The bandit session occurs on the publisher website, wherethe agent has the opportunity to recommend some itemsto users and observe their reactions.The Markov chain of this environment is depicted in Fig. 2,which shows how these two parts are related [25]. First, auser enters the organic environment and sees different itemsat different time steps. Thus, within a variable time interval T ,a user observes one or more items. Then, the bandit sessionstarts, and based on the user’s search history at previoustime steps, the recommender agent suggests several items tohim/her. The Bandit session ends after a time interval, whichis randomly chosen for different people. The user may clickon one of the recommended products during this session. Ifthis happens, the recommender agent receives a reward and theuser will move back to the organic session. Otherwise, transferto the organic session will occur when the bandit session ends.The switching between these two sessions occurs for a randomnumber of times for each user.To train the recommender agent, we ﬁrst deploy the DQN al-gorithm [22] using a one-dimensional CNN to approximate thestate-action value function. Since the number of observationsand the length of time intervals change, variable padding isused for observations before feeding them to the convolutionalnetwork. Exploration rate decreases from 0.9 to 0.1 duringeach episode. We assume that each user’s proﬁle containsten features. Two different loss functions are considered: theMSE and the Huber function in order to demonstrate effect ofloss on the performance. The Huber function, which has beenused in robust regression, has very low sensitivity to outlierscompared to the mean-squares error. The Huber function isdeﬁned as follows: L δ ( y, f ( x )) = (cid:26) ( y − f ( x )) f or | y − f ( x ) | ≤ δδ | y − f ( x ) | − δ otherwise (11)Figure 3a illustrates the Huber loss function for differentvalues of δ and compares it with x and | x | . In all experiments,we set δ = 2 . The recommender agent was built around theDQN algorithm using both Huber and MSE loss functions. Publisher

Website

Organic Session

Bandit Session

User Clicks

Stop

Start

E-Commerce Website

Fig. 2: Markov chain of the organic and bandit sessions in the RecoGym environment.Results reported here were obtained by averaging over 50runs. Average performance ratio of the DQN algorithm forthese two loss functions is depicted in Fig. 3b for 10, 100,and 1000 items. Compared to MSE, using the Huber functionshows an increase of 12 percent in the CTR as the number ofitems increases from 10 to 1000. Figures 4a, 5a, and 6a showthe CTR achieved by the DQN algorithm using the Huberloss function versus the number of episodes for different state-action spaces (i.e., 100, 1000, and 10000 users and items). ﺞﯾﺎﺘﻧ ﯽﺳرﺮﺑ و ﻞﯿﻠﺤﺗ : مﻮﺳ ﻞﺼﻓﺪﺷرا ﯽﺳﺎﻨﺷرﺎﮐ ﻪﻣﺎﻧ نﺎﯾﺎﭘ

INPUT CNN_1D DENSE OUTPUT

Hidden LayersInput layer Output layer

StateAction

Values25

Neuron FiltersArray Of Items KernelSize = 5

Sigmoid Linear (ﻒﻟا)

INPUT LSTM DENSE OUTPUT

Hidden LayersInput layer Output layer

State

Action

Value UnitsArray Of Items

Sigmoid Neuron

Linear (ب)

DQN ﻢﺘﯾرﻮﮕﻟا رد هدﺎﻔﺘﺳا درﻮﻣ يﺎﻫ ﻪﮑﺒﺷ -4 -3 ﻞﮑﺷ δ ﻒﻠﺘﺨﻣ ﺮﯾدﺎﻘﻣ يازا ﻪﺑ ﺮﺑﻮﻫ ﻊﺑﺎﺗ -5 -3 ﻞﮑﺷ (a)(b) Fig. 3: (a) Huber loss function for different values of δ compared to x and | x | . (b) Average performance ratio ofthe DQN algorithm using Huber and MSE loss functions fordifferent number of items.Looking for different items on an e-commerce website by a user can be viewed as a sequential process in which order andhistory are of crucial importance. Item observation by a userin each time step depends on previously observed items duringpast time steps. Hence, we can expect that deploying recurrentneural networks would provide a better approximation of thestate-action value function. We used a modiﬁed version of theDQN algorithm by replacing the convolutional neural networkwith the LSTM, which is a recurrent neural network thatbeneﬁts from gating mechanisms for controlling the ﬂow ofinformation. The LSTM is used to approximate the state-action value function in the DQN algorithm. We repeated thepreviously mentioned experiments with this new architectureof the DQN algorithm that uses both the LSTM and Huberloss function. Again, the number of each user’s features isassumed to be ten, and exploration rate decreases from 0.9 to0.1 in each episode. Results are shown in ﬁgures 4b, 5b, and6b for 100, 1000, and 10000 users and items, respectively.In these ﬁgures, superiority of the new setting that deploysLSTM instead of CNN is obvious in both convergence andperformance. Moreover, for larger action spaces (i.e., moreitems), the LSTM-based DQN shows improved stability andlower variance compared to the CNN-based DQN.Next, we trained a recommender system based on the policygradient method using the RecoGym environment. While theDQN learns an approximation of the optimal state-actionvalue function and from that ﬁnds the optimal policy, thePG algorithm directly ﬁnds the optimal policy by searchingthe policy space. The PG algorithm was implemented usinga policy network that learns a distribution over the actions.Hidden layers of the policy network included LSTM unitsand dense layers, and the output layer used the softmaxfunction to form a probability distribution. The categoricalcross-entropy loss function was used for training the policynetwork. Results achieved by the PG algorithm are comparedwith those of the DQN in ﬁgures 7, 8, and 9 for 100, 1000,and 10000 users and items, respectively. From these ﬁgures,we see that the PG method converges faster than the DQN,which is more signiﬁcant in scenarios with larger state/actionspaces. Moreover, The PG algorithm achieves a better CTRwith less ﬂuctuations and lower variances, especially for largerstate/action spaces, and in effect therefore, presents a morestable behaviour. (a)(b) Fig. 4: Click-through rate versus the number of episodes for100 users and 100 items achieved by: (a) DQN with CNN and(b) DQN with LSTM (a)(b)

Fig. 5: Click-through rate versus the number of episodes for1000 users and 1000 items achieved by: (a) DQN with CNNand (b) DQN with LSTMIn order to obtain a measure that guides us in choosinga proper deep RL algorithm for designing a recommendersystem for a speciﬁc scenario, we need to examine the user-item space. Figures 10a and 10b illustrate the performanceof the DQN and the PG algorithms for state/action spaceswith different dimensions (i.e., different number of users and (a)(b)

Fig. 6: Click-through rate versus the number of episodes for10000 users and 10000 items achieved by: (a) DQN with CNNand (b) DQN with LSTM (a)(b)

Fig. 7: Click-through rate versus the number of episodes for100 users and 100 items achieved by: (a) DQN and (b) PGitems). These ﬁgures were plotted using logarithmic-scale forboth axes. According to these ﬁgures, the DQN and the PGalgorithms can be respectively recommended for lower andhigher dimensional state/action spaces. (a)(b)

Fig. 8: Click-through rate versus the number of episodes for1000 users and 1000 items achieved by: (a) DQN and (b) PG (a)(b)

Fig. 9: Click-through rate versus the number of episodes for10000 users and 10000 items achieved by: (a) DQN and (b)PG IV. C

ONCLUSION

We deployed two deep reinforcement learning algorithmsto design recommender systems for online advertising: theDeep Q-network and the policy gradient method. While theformer is a critic-only algorithm, the latter is an actor-onlyone. The RecoGym was used as the environment with which (a) DQN(b) PG

Fig. 10: Performance area for different RL algorithms: (a)DQN and (b) PG. Both axes are in logarithmic scale.the RL agent interacts. The RL agent aims at maximising theclick-through rate for the recommended items. The originalarchitecture of the DQN was modiﬁed using the LSTM asthe value network to take account of the key role that orderplays in the search history and observing different itemsby users. Moreover, the Huber function was adopted fromrobust statistics as the loss function. These two modiﬁca-tions improved the convergence characteristics of the DQNalgorithm and led to less ﬂuctuations and lower variance inthe click-through rate. Regarding the importance of orderin the sequence of observations of different items by users,the LSTM was used to implement the policy network forthe PG algorithm as well. The PG algorithm showed even abetter convergence behaviour and lower variance in the click-through rate compared to the LSTM-based DQN, especiallyfor larger state/action spaces. These results conﬁrm the factthat actor-only algorithms are more resilient against fast-changing and nonstationary environments compared to critic-only algorithms. Finally, performance area contours were usedto provide guidelines for choosing a proper deep reinforcementlearning algorithm for building recommender systems basedon dimensions of the state/action spaces (i.e., the number ofusers and items that the recommender system is expected to handle). R

EFERENCES[1] Z. Batmaz, A. Yurekli, A. Bilge, and C. Kaleli, “A review on deeplearning for recommender systems: challenges and remedies,”

ArtiﬁcialIntelligence Review , vol. 52, no. 1, pp. 1–37, 2019.[2] A. S. Lampropoulos and G. A. Tsihrintzis, “Machine learn-ing paradigms,”

Applications in recommender systems. Switzerland:Springer International Publishing , 2015.[3] I. Portugal, P. Alencar, and D. Cowan, “The use of machine learningalgorithms in recommender systems: A systematic review,”

ExpertSystems with Applications , vol. 97, pp. 205–227, 2018.[4] C. M. Bishop,

Pattern recognition and machine learning . springer,2006.[5] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “Asurvey of actor-critic reinforcement learning: Standard and natural policygradients,”

IEEE Transactions on Systems, Man, and Cybernetics-PartC: Applications and Reviews , vol. 42, no. 6, pp. 1291–1307, 2012.[6] X. Zhao, L. Zhang, Z. Ding, D. Yin, Y. Zhao, and J. Tang, “Deepreinforcement learning for list-wise recommendations,” arXiv preprintarXiv:1801.00209 , 2017.[7] Q. Wang, C. Zeng, W. Zhou, T. Li, S. S. Iyengar, L. Shwartz, andG. Grabarnik, “Online interactive collaborative ﬁltering using multi-armed bandit with dependent arms,”

IEEE Transactions on Knowledgeand Data Engineering , 2018.[8] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, andZ. Li, “Drn: A deep reinforcement learning framework for news recom-mendation,” in

Proceedings of the 2018 World Wide Web Conference .International World Wide Web Conferences Steering Committee, 2018,pp. 167–176.[9] A. Tripathi, D. Manasa, K. Rakshitha, T. Ashwin, and G. R. M. Reddy,“Role of intensity of emotions for effective personalized video recom-mendation: A reinforcement learning approach,” in

Recent Findings inIntelligent Computing Techniques . Springer, 2018, pp. 507–517.[10] Z. Zhao and X. Chen, “Deep reinforcement learning based recommendsystem using stratiﬁed sampling,” in

IOP Conference Series: MaterialsScience and Engineering , vol. 466, no. 1. IOP Publishing, 2018, p.012110.[11] S.-Y. Chen, Y. Yu, Q. Da, J. Tan, H.-K. Huang, and H.-H. Tang, “Stabi-lizing reinforcement learning in dynamic environment with applicationto online recommendation,” in

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining .ACM, 2018, pp. 1187–1196.[12] X. Zhao, C. Gu, H. Zhang, X. Liu, X. Yang, and J. Tang, “Deepreinforcement learning for online advertising in recommender systems,”2019.[13] Y. Lei, Z. Wang, W. Li, and H. Pei, “Social attentive deep q-network forrecommendation,” in

Proceedings of the 42nd International ACM SIGIRConference on Research and Development in Information Retrieval .ACM, 2019, pp. 1189–1192.[14] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generativeadversarial user model for reinforcement learning based recommenda-tion system,” in

Proceedings of the 36th International Conference onMachine Learning , ser. Proceedings of Machine Learning Research,K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach,California, USA: PMLR, 09–15 Jun 2019, pp. 1052–1061.[15] Y. Sun and Y. Zhang, “Conversational recommender system,” in

The41st International ACM SIGIR Conference on Research & Developmentin Information Retrieval . ACM, 2018, pp. 235–244.[16] F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He, “Policy gradientsfor contextual recommendations,” in

The World Wide Web Conference .ACM, 2019, pp. 1421–1431.[17] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H.Chi, “Top-k off-policy correction for a reinforce recommender system,”in

Proceedings of the Twelfth ACM International Conference on WebSearch and Data Mining . ACM, 2019, pp. 456–464.[18] J. Yoo, H. Ha, J. Yi, J. Ryu, C. Kim, J.-W. Ha, Y.-H. Kim, and S. Yoon,“Energy-based sequence gans for recommendation and their connectionto imitation learning,” arXiv preprint arXiv:1706.09200 , 2017.[19] F. Zhou, R. Yin, K. Zhang, G. Trajcevski, T. Zhong, and J. Wu,“Adversarial point-of-interest recommendation,” in

The World Wide WebConference . ACM, 2019, pp. 3462–34 618. [20] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, “Reinforcement learningto rank in e-commerce search engine: Formalization, analysis, andapplication,” in

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . ACM, 2018, pp.368–377.[21] F. Liu, R. Tang, X. Li, W. Zhang, Y. Ye, H. Chen, H. Guo, and Y. Zhang,“Deep reinforcement learning based recommendation with explicit user-item interactions modeling,” arXiv preprint arXiv:1810.12027 , 2018.[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[23] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” in

Advances in neural information processing systems , 2000, pp.1057–1063.[24] S. S. Mousavi, M. Schukat, and E. Howley, “Trafﬁc light control usingdeep policy-gradient and value-function-based reinforcement learning,”

IET Intelligent Transport Systems , vol. 11, no. 7, pp. 417–423, 2017.[25] D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou,“RecoGym: A reinforcement learning environment for the problemof product recommendation in online advertising,” arXiv preprintarXiv:1808.00720arXiv preprintarXiv:1808.00720