[PDF] Learning to Infer User Hidden States for Online Sequential Advertising

Abstract

To drive purchase in online advertising, it is of the advertiser's great interest to optimize the sequential advertising strategy whose performance and interpretability are both important. The lack of interpretability in existing deep reinforcement learning methods makes it not easy to understand, diagnose and further optimize the strategy. In this paper, we propose our Deep Intents Sequential Advertising (DISA) method to address these issues. The key part of interpretability is to understand a consumer's purchase intent which is, however, unobservable (called hidden states). In this paper, we model this intention as a latent variable and formulate the problem as a Partially Observable Markov Decision Process (POMDP) where the underlying intents are inferred based on the observable behaviors. Large-scale industrial offline and online experiments demonstrate our method's superior performance over several baselines. The inferred hidden states are analyzed, and the results prove the rationality of our inference.

Full PDF

LLearning to Infer User Hidden States for Online SequentialAdvertising

Zhaoqing Peng, Junqi Jin

Alibaba Group

Lan Luo

University of Southern California

Yaodong Yang, Rui Luo

University College London

Jun Wang

University College London

Weinan Zhang

Shanghai Jiao Tong University

Haiyang Xu

Alibaba Group

Miao Xu, Chuan Yu

Alibaba Group

Tiejian Luo

Univ. of Chinese Academy of Sciences

Han Li, Jian Xu, Kun Gai

Alibaba Group

ABSTRACT

To drive purchase in online advertising, it is of the advertiser’sgreat interest to optimize the sequential advertising strategy whoseperformance and interpretability are both important. The lack ofinterpretability in existing deep reinforcement learning methodsmakes it not easy to understand, diagnose and further optimizethe strategy. In this paper, we propose our Deep Intents SequentialAdvertising (DISA) method to address these issues. The key partof interpretability is to understand a consumer’s purchase intentwhich is, however, unobservable (called hidden states). In this paper,we model this intention as a latent variable and formulate the prob-lem as a Partially Observable Markov Decision Process (POMDP)where the underlying intents are inferred based on the observablebehaviors. Large-scale industrial offline and online experimentsdemonstrate our method’s superior performance over several base-lines. The inferred hidden states are analyzed, and the results provethe rationality of our inference.

CCS CONCEPTS • Information systems → Display advertising ; •

Theory ofcomputation → Sequential decision making . KEYWORDS

Partially Observable Markov Decision Process; Online Advertising

ACM Reference Format:

Zhaoqing Peng, Junqi Jin, Lan Luo, Yaodong Yang, Rui Luo, Jun Wang,Weinan Zhang, Haiyang Xu, Miao Xu, Chuan Yu, Tiejian Luo, and HanLi, Jian Xu, Kun Gai. 2020. Learning to Infer User Hidden States for On-line Sequential Advertising. In

Proceedings of the 29th ACM InternationalConference on Information and Knowledge Management (CIKM ’20), October19–23, 2020, Virtual Event, Ireland.

ACM, New York, NY, USA, 11 pages.https://doi.org/10.1145/3340531.3412721

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Online advertising is an effective way for advertisers to reach theirtargeted audiences and drive conversions. Compared to a singlead exposure, sequential advertising [27] has a higher chance ofcultivating consumers’ awareness, interest and driving purchasesin several steps through multiple scenarios. Fig. 1 shows an exampleof sequential advertising on a Gaming chair in two scenarios. Attime t , the consumer browses and becomes aware of the chair inscenario No. 1. At time t , he sees it again and shows interest byclicking it. After a while, the consumer visits scenario No. 2 andfinally clicks and makes a purchase at time t and t . To maximizethe return on investment (ROI), advertisers have a great desire tooptimize sequential advertising strategies.Advertising strategies’ optimization and interpretability are bothvery crucial. The significance of optimization comes from its directresults of the ROI. Interpretability helps advertisers understand thestrategy, provides ways to diagnose, conduct conversion attribution,and finally supports further optimization.The advertising algorithm design for combining performanceand interpretability is very challenging. The key to interpretabilityis modeling the consumer’s mental states under a sequence of inter-actions with ads. However, these mental states/intents are difficultto define, and they are even unobservable. The only informationrelated is the observed consumer’s behaviors, e.g., click and pur-chase actions. Most interpretable algorithms tend to use shallowmodels such as logistic regression (non-neural network) for moreconvenient analysis; however, they cannot benefit from currentadvances of deep learning techniques [16, 17, 24].To overcome these difficulties, there are several related works.Interpretable methods like multi-touch attribution (MTA) [12, 27]focus on assigning credits to the previously displayed ads beforethe conversion, but they usually do not provide future strategy op-timization. Performance-oriented methods such as deep reinforce-ment learning (DRL) usually aggregate the consumer’s historicalbehaviors as an input of a black-box neural network and obtainthe advertising action directly from the output of the network[4, 5, 7, 9, 13]. This kind of straightforward aggregation of behav-iors cannot represent and interpret the consumer’s mental stateswell, which makes understanding, diagnosing, and optimizing thestrategy difficult. Some algorithms give considerations to both inter-pretability and strategy optimization [16–18, 21]. Nonetheless, themajority of these methods are limited to theoretical analysis, and a r X i v : . [ c s . A I] S e p 𝑡 𝑡 𝑡 view click purchaseclick Game host -i7 Gtx1060 -cool LED

Lounge chair -soft, handsome -wood made

GTX 1080 -deep learning -gamming Gaming chair -comfortable -gaming specific

E-App (Scenario No. 1)

Gaming chair $ 158 (5354 paid) Mainboard $ 1100 (45 paid)

E-App (Scenario No. 2)

Home chair $ 78 (25341 paid)Swimming pants $ 38 (1367 paid)

Gaming chair -comfortable-gaming specific

Logitech mouse -high speed-programmable

Men Socks -breathable -cotton fibre

Rocking chair -healthy-high quality

E-App (Scenario No. 1)

Figure 1: An advertised item on consumer trajectories acrossmultiple scenarios. the experiments are conducted mostly in toy simulated environ-ments, which are impractical for realistic industrial applications.Considering the above challenges and shortcomings, we pro-pose our Deep Intents Sequential Advertising (DISA) algorithmto address these issues in advertising applications. We formulatethe multi-step advertising problem as a Markov Decision Process(MDP). In this MDP, the consumer intents (state) are not directly ob-servable, so we use POMDP to model the state as a hidden variableinferred by observed behaviors. However, as a probabilistic frame-work, POMDP’s parameters are not off-the-shelf. To tackle thisissue, we derive an expectationâĂŞmaximization (EM) algorithmto estimate the parameters by learning from large-scale real-worlddata. The learned POMDP model can infer the probability distribu-tion of user hidden states, defined as beliefs. Unlike noisy behaviordata, beliefs are more abstract, and we can interpret how probablea user visits each hidden state and to which state it may transit.Finally, we optimize the sequential advertising strategy dependingon the beliefs. Since the learning of the exact POMDP’s optimumpolicy is intractable [21], we approximate the belief value functionusing a variant of Smooth Partially Observable Value Approxima-tion (SPOVA) method [23]. It is a more suitable deep architecturefor POMDP than pure black-box Deep Q-Network (DQN).Offline experiments show that our method is better than sev-eral baselines. Online results demonstrate our sequential advertis-ing’s superior performance over the existing system. In terms ofinterpretability, we analyze the inferred hidden states and provideexamples of state transitions under different advertising strategies.Our main contributions include: 1) To our best knowledge, ourDISA is the first attempt focusing on the interpretability of realisticadvertising strategies with POMDP. 2) To optimize the strategyperformance, we propose a variant of SPOVA method, a more suit-able deep neural network solution for POMDP than pure black-boxdeep networks commonly used in DRL. 3) We develop POMDP’sapplication in large-scale industrial settings. The inferred hiddenstates are analyzed to show the efficacy of our method.The rest of this paper is organized as follows. Section 2 intro-duces the recent work related to POMDPs, followed by an analysisof the sequential advertising problem in section 3. Section 4 formu-lates the problem. Section 5 presents our approach to the problemand gives detailed implementations. In Section 6, we discuss the experimental results and interpret the hidden states as well as thelearned advertising strategies. Section 7 concludes the paper.

Generally, a POMDP model can be considered as a belief-stateMDP [21]. The Hidden Markov Model (HMM) is usually used torepresent the hidden states of POMDP [16–18]. The Baum-Welchalgorithm [15] is extended to adjust the probabilities of the Markovmodel, while Bayesian-based methods [24, 25] can improve themodel through interaction with the environment. For policy learn-ing, structured representations are usually used to solve the valueapproximation [3, 26]. The neural network is first introduced toyield good value approximations in SPOVA [23], and the recurrentneural network (RNN) is adopted in QMDP-net [14] for the planningof POMDP. However, these methods are usually evaluated withsimple tasks and impractical for realistic applications. AlthoughMTA methods [12, 27] know how each exposure contributes to theconversion, they do not model user latent states and cannot supportonline inference; thus, they cannot directly solve our problem.In applications of MDP and POMDP, bandit-based models withThompson sampling are widely used in simple recommendationproblems [19]. Yuan and Wang [31] propose to utilize the corre-lation of ads to improve the efficiency of exploration. These ap-plications haven’t taken advantage of current deep learning mer-its for better performance. There are some DRL-based solutions[5, 7, 9, 10, 32] to ranking problems. Hu et al. [9] propose a policygradient algorithm to learn an optimal ranking policy by modelingthe reward function. Ie et al. [10] optimize the slate-based rec-ommendations based on estimated long-term value. These worksmainly use end-to-end deep learning methods and are weak interms of interpretability. DeepIntent [32] models the intents usingthe attention weights on top of RNN, its black-box learning cannotexplicitly model intents’ transitions; thus, the sample complexitycould be higher without the prior knowledge that user behaviorsare generated based on hidden state transitions.

In large mobile E-commerce platforms, e.g., Amazon, eBay, Taobao,there are millions of users visiting different scenarios every day. Therepeated visits of these users allow the platform to help advertisersearn more revenue with appropriate multi-step advertising strate-gies. In this paper, we follow the framework of MDP, and we careabout the interpretations of the displaying effect on a user purchaseintention. This interpretability benefit us in 1) attribution: easilyinterpret the insights of user conversions, 2) optimization: guide thefuture advertising policy in other similar applications. However, theuser intention is not directly observable, so we model it as a hiddenstate. To do this, we formalize this problem as a POMDP wherethe agent (the advertising engine) learns to maximize advertisers’revenue by inferring the consumers’ hidden state.

Generally, at each time-step t , an advertising campaign starts witha user request U t , which contains the user name, age, and historicalbehaviors. The handling of the request can be formalized as: (1) 𝑡 𝑎 𝑡 𝑎 𝑡−1 ,𝑏 𝑡−1 𝑏 𝑡−2 𝑏 𝑡−1 𝑎 𝑡−2 𝑜 𝑡−1 𝑏 𝑡 𝑜 𝑡 𝑎 𝑡−1 𝑎 𝑡 𝑟 𝑡−1 𝑟 𝑡 𝑟 𝑡 Environment (users)

Agent (the advertising engine)

State

Estimator Policy LearnerState Estimator Policy Learner

Environment (users) (a) (b) t-1 t

Figure 2: (a) Stream flow unrolled in temporal sequence; (b)the interactions between the agent and the environment.Matching stage , by comparing the relevance of different items w.r.tthe user, a candidate ad set D t = { I , I , . . . , I |D t | } is recalled usingsome matching methods like TDM [33]s, and I i is the i -th campaignlaunched by an advertiser X i . (2) Sorting stage , the advertisingengine performs a ranking function f t on the set D t , and top K items L tK (D t , f t ) = ( I ( ) , I ( ) , . . . , I ( K ) ) are selected and deliveredback to the consumer. Here, K is determined by the type of scenarios.(3) Feedback stage , for each displayed item I i , the advertisingengine collects the feedback of the user purchase behavior y t andclick behavior x t . The advertiser X i will pay money bid i to theadvertising engine if the user clicks ( x t =

1) and will obtain revenue price i when the user purchases ( y t = Q = ( U ∼ U T ) from aconsumer, our problem is defined as a sequential decision processto determine the appropriate ad items (L K ∼ L TK ) to maximizethe advertisers’ profits. The ranking function here is designed to bea set of score actions f t = { a t ∼ a ti } on each candidate item I i in D t , which are the output of the agent. To interpret each advertisingaction a ti , we need to know how a ti will affect or transit a user latentintent on the item, which can be explicitly modeled by a POMDP.Specifically, a POMDP model is a 7-tuple ( S , A, O, T , O , r , γ )where S is a set of discrete hidden states describing the intents ofa user, A is a set of score actions on an item, and O is a set of theagent’s observations on user behavior to the item. The transitionfunction T ( s , s ′ , a ) = P ( s ′ | s , a ) describes the probability of transi-tion from state s to s ′ after executing action a , while observationfunction O ( s ′ , a , o ) = P ( o | s ′ , a ) specifies the probability that a nextobservation o will be received after the agent performs action a andlands in state s ′ . The reward r captures the expected feedback fromthe environment, and γ ∈ [ , ] is the discounted factor.At each time-step t , an advertising action a t is decided givenan observation o t , which brings up two steps: 1) the agent infers abelief b t (defined as a probability distribution over all hidden states)with a state estimator, 2) the action a t is chosen based on b t with apolicy learner. Fig. 2 gives the two steps as following. State Estimator (SE) . According to the parameters of T and O ,the state estimator produces the current belief b t with the observa-tion o t , the previous b t − and a t − using the Bayes rule: b t ( s ′ ) = ρO ( s ′ , a t − , o t ) (cid:213) s ∈S T ( s , s ′ , a t − ) b t − ( s ) , (1) Usually, the final items L tK can be affected by recalling different ads D t in thematching stage or adjusting the ranking function f t in the sorting stage. In this paper,we only consider how to use f t to control the final displayed items. A user may have multiple intents on different items, and we can feed the model withdifferent items to get different intents. where ρ is the normalized factor, and b ( s ) represents the probabilitythat a consumer hidden state is under state s . Policy Learner . After estimating b t , the agent has to learn themappings from beliefs to actions, denoted by a policy a t = π ( b t ) .One could think of a POMDP as an MDP defined over belief states,then the well-known Bellman equation for POMDP still holds [21].In particular, V ∗ ( b t ) = max a t [ r t + γ (cid:213) o ∈ O P ( o t | a t , b t ) V ∗ ( b t + )] (2)where V ∗ ( b t ) is the belief value function with an optimal policy π ∗ .Unlike the budget constraint setting in [13, 30], our agent’s goalis to maximize the advertiser’s profits within a certain time window T w . The window is usually set according to how soon most ofthe conversions are reached after ad exposures. The profits aredefined as the advertiser’s revenue subtracting the budget cost. Thereward is therefore given by r t , i = price i y t − bid i x t . The objectiveof learning is to find an optimal policy to maximize the expectedreturn of each item I i . π ∗ i = arg max π i E  T (cid:213) t = (cid:213) i ∈L k (D l , f t ) γ t r t , i | π i  (3) In this section, we introduce our proposed DISA with three parts.We first present an EM-based method to estimate the parametersof the state estimator. We then adopt an approximated methodfor the policy learner to optimize the value function over beliefs.Finally, we give the specific implementation of DISA with the realadvertising engine.

To perform belief updates with the Eq. (1), we firstly need to knowthe transition function T and observation function O . However,these two fundamental functions are not available priori in our case,and we have to estimate them in advance. Essentially, a POMDP canbe regarded as an extended HMMs conditioned on a sequence ofactions. As such, we can learn the parameters of POMDP by buildinga conditional HMMs and solving it with EM-based algorithms.Based on the analysis, we now describe the parameter estimationas a learning problem of a conditional HMM model parameterizedby θ = ( b , T , O ) where b is the initial distribution of hidden states.Given a trajectory J T = ( a , o , a , o , a , . . . , o T ) on an ad, we tryto find the parameters to best fit the trajectory with user latentvariables S . Specifically, let O = { o ∼ o T } and A = { a ∼ a T − } denote the sequence of observations and corresponding actions in J T (each observation is given equal weight), we study the maxi-mization of the log-likelihood of O conditioned on A : l ( θ ) = log P (O|A ; θ )≥ (cid:213) s ∈S q ( s ) log P (O , s |A ; θ ) − (cid:213) s ∈S q ( s ) log q ( s ) = L ( q , θ ; O , A) (4)where q ( s ) is a density function that satisfies (cid:205) s q ( s ) =

1. In thelower bound L ( q , θ ; O , A) , the ≥ follows Jensen’s inequality, andhe equality is only reached at q ( s ) = P ( s |O , A ; θ ) . Following theEM algorithm, at each time-step t , our E-step is to estimate: q t = arg max q L ( q , θ t − ; O , A) = P ( s |O , A ; θ t − ) (5)The M-step is to adjust θ by maximizing the Q-function with q t : θ t = arg max θ E q t ( s ) log P ( s , O|A ; θ t − ) (6)We derive a variant of Baum-Welch algorithm to implementthe above iterative procedures, and the details can be found inSupplementary A.1. When we obtain the estimated T and O , thecurrent belief b t can be updated by Eq. (1). Our next step is to learnthe action policy a t = π ( b t ) with the given belief b t . A critical question for policy learning is how to represent the valuefunction for beliefs. Sondik [24] showed the value function V ( b ) can be represented as the max over a finite set of vectors. However,exact methods for solving this are impractical [21], and functionapproximation is a more attractive alternative than exact methods.In this paper, we prefer to implement the approximation with deepneural networks to improve the learning of our strategies. As such,we approximate V ( b ) using a set of parameterized Q-functions: V ( b ) = max a Q a ( b ; η a ) (7)where Q a ( b ; η a ) is the expected return for taking action a in belief b , and each Q-function is approximated by a soft max functionSPOVA[24]: Q a ( b ; η a ) = z (cid:118)(cid:116) n (cid:213) i = ( b · η a i ) z (8)here each η a i is the output vector of deep neural networks w.r.t anaction a , and the value of n determines how many vectors are usedto split the belief space into linear representations. z is an indicatorinterpreted as a measure of how "rigid" the approximation is [23].Given the Q-function, our policy π is then to select the action withthe largest Q-value: a t = arg max a Q a ( b t ; η a ) .Assuming b ′ is the updated belief after performing best ac-tion a in b , the optimization of the value function is performedby minimizing the square of Bellman residual E ( b ) [20] where E ( b ) = γV ( b ′ ) + r − Q a ( b ; η a ) . Since Eq. (8) is differentiable, a typi-cal gradient descent method can be used to update each vector η a i .The updates for the j -th component of the i -th η vector, η a ji turnsout to be: △ η a ji = αE ( b ) b j ( b · η a i ) z V ( b ) z (9)where α refers to a step size or learning rate. Note that we shouldkeep each η a i positive to allow the second derivative of Eq. (8)always positive in each dimension, so the function is always convex.This can be done by replacing ( b · η a ) with ( b · η a + υ ) where υ isa constant offset [23]. However, we found that a large constant υ will bring updating bias when γ >

0, which leads to an unstablelearning process. To address this, we compensate the bias in Bellmanresidual: E ( b ) = γV ( b ′ ) + r − Q a ( b ; η a ) + ( − γ ) υ . An alternative isto use reward shaping to keep rewards r always positive, which canprevent the learning direction of η a i from going towards negativevalues. 𝐛 𝐭 𝐓 𝐭𝑆 ×|𝑆| 𝑇(𝑠,𝑎,𝑠′)

Slice along 𝑎 𝑡 |𝑆| | 𝐴 | = 𝐛 𝐭+𝟏 |𝑆′| ̇ where 𝑂 𝑜 𝑡 𝑠 𝑖 ,𝑎 𝑡 = 𝑗 𝑂(𝑜 𝑡 [𝑗]|𝑠 𝑖 ,𝑎 𝑡 )𝐎 𝐭 𝑆 𝐎 𝐭𝑆 = [𝑂 𝑜 𝑡 𝑠 ,𝑎 𝑡 ,…,𝑂 𝑜 𝑡 𝑠 𝑖 ,𝑎 𝑡 ,…] Transition Parameters Observation Parameters + norm (a) Belief updating. 𝑄(𝑏,𝑎 ) 𝑄(𝑏,𝑎 𝑖 ) Smooth

Max 𝑏 Hidden

Layers Smooth

Max 𝑏 ⋅ 𝜂 𝑎 𝑖 … (b) Policy network. Figure 3: Implementation of DISA.

Here, we illustrate our detailed solution to the real advertisingoptimization, including some key concepts of applying DISA, aswell as the implementation of the state estimator and policy learner.

Item modeling level . From the online data, we found the sam-ples of repeated exposures for a specific item are sparse, whichbrings difficulties in training. In this paper, we relax the POMDPsmodeling level from items to categories, and different consumersshare the parameters of DISA during learning and execution. Thissetting can largely increase the quantity and diversity of learningsamples and improve the model’s generalization. Note that we usethe most fine-grained categories maintained by the advertising sys-tem. According to our data, although the category features will losesome individual information, our categories are detailed enoughthat the individual differences within a category are small. We willstudy better aggregation methods in the future.

Action . The ranking function f for an advertising platform isusually designed using eCPM sorting mechanism [13], which aimsto maximize the revenue of the platform, given by rank _ score = pCT R × bid . We follow this setting, and we perform actions onthe rank _ scores at the categorical level. In particular, we use aratio δ j to adjust the rank score of an item I i that belongs to the j -th category, that is rank _ score ′ i = rank _ score i × δ j . This δ is theoutput of the agent’s action that could affect the ranking of itemsso as to decide the final displayed item. For a discrete action setting,we define three actions: a boosting action with δ j >

1, a restrainingaction with δ j <

1, and a keeping action with δ j =

1; the value of δ for each action should be tuned under this setting. Reward . As mentioned in Section 4, our reward is defined asthe advertisers’ revenue subtracting their budget cost. This rewardsetting may suffer from a local-optimal policy: the agent learnsto increase the rewards by reducing advertisers’ budget cost. Totackle this problem, we propose a bid punishment mechanism toforce the agent focusing on improving the revenue rather thanreducing budget cost. We increase the bid price for the boostingaction: bid ′ = bid × β where β ⩾ r t , i = (cid:40) λprice i y t − β j bid i x t , if a j is boosting action λprice i y t − bid i x t , otherwisewhere λ is set to balance the data magnitude between revenueand cost so that the agent can equally optimize revenue and cost(purchase y t is more sparse than click x t ). More details will bediscussed in our experiment. tate estimator . Following the Eq. (1), we illustrate a beliefupdating process in Fig. 3(a). Considering a case where actions andobservations are discrete, the transition function T : |S| × |A| × |S| can be parameterized by a 3-dim vector cube. Given a |S| -dimbelief vector b t and a performed action a t , our first step is a dotproduction: b t = b t · T t |S |×|S | where T t |S |×|S | is a transitionmatrix sliced from T along action a t . Suppose there are G -dimobservations and each dimension is independent with each other,so we have O ( o t | s i , a t ) = (cid:206) Gj O ( o t [ j ]| s i , a t ) , where O ( o t [ j ]| s i , a t ) is the probability of observing j -th dimension in o t given state s i andaction a t . Then our second step is an element-wise multiplicationof b t with a vector O t |S | = [ O ( o t | s , a t ) , O ( o t | s , a t ) , ... ] , that is b t = b t ⊗ O t |S | . Our final step is followed by a normalized operation ρ : b t [ i ] = b t [ i ]/ (cid:205) j b t [ j ] , and it produces the next belief b t + . Policy learner . The policy learner is implemented with a deepneural network such as multi-layer perception (MLP) as Fig. 3(b)depicts. The input of this policy network is the belief vector, andthe output is split into |A| groups. The output of each group isconducted with the max smooth function of Eq. (8) to obtain theQ-value function for each action. In this case, the η vectors foreach action are embedded into the parameters of hidden layers,which are trained end-to-end through the whole policy network bya gradient descent method in Eq. (9). Simulator . For offline experiments, we offer a simulator to im-itate consumers’ feedback by applying supervised learning tech-niques on real consumer behavior. Similar simulator settings canbe found in [5, 9, 28]. Since a userâĂŹs preferences can be time-dependent and also depend on the history of past ad impressions,we choose a recurrent model to make multi-task predictions onthe real click x t and purchase y t . In particular, at each time-step t , we adopt an RNN model to output a vector ˆp t = ( ˆ x t , ˆ y t ) whereˆ x t and ˆ y t are the predicted probability of the click and purchaseaction on I t . The recurrent model is implemented by one stacklayer LSTM [29] with the hidden size of 256, and we unroll theLSTM cell in a maximum sequence length of 25. We optimize thesimulator network using the sum of cross-entropy loss betweenthe ground-truth p t and ˆp t across all time-steps. We showcase the effectiveness of our approach in a series of sim-ulated experiments and live experiments in a real-world Taobaoad system. We consider two scenarios in the homepage of TaobaoApp: 1)

Good Items targets the consumers with a high expense, sothe ad items are usually in high quality; 2)

Guess What You Like aims to perform personalized advertising strategies, and thus theitems are chosen based on users’ preferences, interests, and recentbehaviors.

The dataset includes 58,648 request sessions from 4,988 sampledusers in the two scenarios within three days. Each request containsa candidate ad set D (50 ≤ |D| ≤ Dataset is available: https://github.com/465935564/sequential_advertising_data

Iteration epochs L o ss Training SetTest Set (a) Simulator Learning (loss).

Iteration epochs A cc u r a c y ( A U C ) Training Set (Click)Test Set (Click)Training Set (Purchase)Test Set (Purchase) (b) Simulator Learning (AUC).

Sequence length P r o b a b ili t y Predicted ClickClickPredicted PurchasePurchase (c) Simulator Predictions.

Iteration l o g P ( o b s | T , O , b ) Training SetTest Set (d) Learning Curves of EM model.

Figure 4: Training result of the simulator and EM model.

Each category has 5 ads on average. As each scenario has onlyone ad position, we have K =1. All the request sessions of a consumerare sorted in session time to form a consumer trajectory. We use90% of the trajectories as a training set while the rest 10% leavesfor test evaluation. To conduct offline experiments, we trainan environment simulator to imitate user click and purchase actionson an ad item. When the agent decides on an item for a user, oursimulator will generate the click and conversion rate for this ad-userpair, from which we sample the final click/purchase actions. Theconsistency of simulated data and the real-world data is important,and thus we evaluate the simulator in 3 ways:We first show the learning loss of the training and test set inFig. 4(a), which illustrates the loss converges well. The learningaccuracy (AUC score) of two predictions are given in Fig. 4(b). Weachieved 0.732 AUC for click and 0.771 AUC for purchase at 50-thepochs, which proves the prediction ability of our learned simu-lator. Apart from the accuracy curves, we compare the simulatedprediction with the ground-truth data, depicted as Fig. 4(c). Thefigure shows that the simulator can correctly predict the trends ofreal data. Beyond that, we also find an interesting phenomenon:the conversion rate (the blue line in Fig. 4(c)) will increase if weimpress a user by repeated displays, which indicates the potentialbenefits of sequentially repeated advertising exposures.

In this part, the agent opti-mizes its advertising strategies based on the user latent states bythe feedback provided by the simulator. To infer user states, wetrain and evaluate the EM model by the log probability curves ofthe observed sequences in the test set shown in Fig. 4(d), wherethe parameters converge well. The number of |S| controls howfine we split users’ latent states S = { s , s , ... s j } . In our case, weuse |S| = |S| >

3. The following algorithms are arameter Method Revenue Cost ROI Reward- Manual bid 100% 100% 100% 100%- Bandit 107.6% 99.6% 108.1% 112.8% γ =0.1 DQN 99.3% 99.1% 100.2% 101.4%EM-DQN 92.2% 91.1% 101.2% 103.3%ADRQN 95.5% 97.1% 98.3% 107.2%DISA γ =0.3 DQN 104.5% 100.9% 103.5% 112.9%EM-DQN 105.1% 101.2% 103.8% 107.8%ADRQN 110.7% 101.3% 109.2% 114.5%DISA γ =0.5 DQN 110.3% 101.2% 109.0% 114.6%EM-DQN 111.2% 101.7% 109.2% 114.5%ADRQN 112.9% 102.7% 110.0% 116.7%DISA γ =0.7 DQN 109.9% 101.5% 108.3% 113.3%EM-DQN 109.9% 101.0% 108.8% 112.8%ADRQN 116.8% 103.4% 112.8% 120.0%DISA γ =0.9 DQN 112.3% 101.9% 110.1% 115.7%EM-DQN 113.4% 102.1% 111.0% 116.2%ADRQN 117.0% 103.0% 113.6% 121.0%DISA Table 1: Performance under different parameter settings. pv in Good Items (pv gi ) P r obab ili t y s s s StartAwareness Interest Search 𝑠 " 𝑠 𝑠 $ Figure 5: Distributions of the learned parameter O ( o | s i ) (Left), and the diagram for state transitions (Right). compared with our method with the same observations, actions,and rewards settings . Manual bid . It’s the bid strategy using humans’ experience [13].

Bandit . Contextual bandit [2] is an online algorithm that maxi-mizes the total payoff of the chosen actions given the context.

DQN . DQN [20] is a model-free RL algorithm. It directly takesin the observations and outputs the ranking policy by selecting thelargest Q-value action.

ADRQN . It is a recurrent variant of DQN where the currentobservation and the last time-step action are fed to an LSTM net-work [34]. This is a model-free POMDP where the latent state isimplicitly captured and modeled by the LSTM.

DISA . This is our proposed model-based POMDP algorithm.DISA explicitly estimates the beliefs (distribution of hidden states)and learns to optimize its policy by the belief value approximation.

EM-DQN . It is a variant of DQN where its input is the beliefs ofDISA rather than observations. This method attempts to learn themappings from beliefs to actions with the model-free RL.For fair comparisons, several experiments are conducted to showthe performance of different methods with the same γ parameterin Table 3. Each method is evaluated by the ROI indicator (rev-enue/cost) and the average rewards (advertisers’ profits). A higherROI shows the stronger ability of earning more income with the Due to our settings in discrete-actions and memory replays, we do not considercontinuous-action or asynchronous-specific RL techniques, such as DDPG, A3C, etc.

State Init Observation E [ O ( o j | s i )] Transition T ( s ′ | s ) b o o o o o s ′ s ′ s ′ (pv gi ) (clk gi ) (pv gw ) (clk gw ) (scen) s s s ≈ ≈ Here, each o j refers to one dimension of o vector, that is O ( o j | s i ) = O ( o [ j ]| s i ) Table 2: Statistics of the learned parameters in SE. same budget cost. A higher reward is also important as it indicatesa method can help advertisers obtain more profits.From Table 3, almost all the RL-based methods achieve higherROI than Bandit method with γ ∈ { . , . , . } since their decision-making is based on the long-term rewards. Under the same settingof γ , DISA outperforms all the others in ROI while achieving almostthe same cost as other baselines. These results indicate the supe-riority of DISA as it not only helps advertisers earn more incomeper budget cost but also improves profits. Compared with DQN,for all γ , a higher ROI of EM-DQN shows the benefits of inferringbeliefs over the behavior-action mappings (black-box) in model-free fashion. Furthermore, DISA also demonstrates its advantage ofthe belief value approximation in SPOVA over the general neuralnetwork (pure belief-action mappings) in EM-DQN by ROI. Essentially, theEM learns a mapping from high-dimensional historical observa-tions/actions to a compressed belief state, and this mapping isreflected in the learned parameters T ( s ′ | s i , a ) , O ( o | s i , a ) and b ( s i ) .By analyzing these parameters, we can know how each state con-nects with different observations, so we can further interpret theproperty of each state s i . To do this, one direct way is to comparethe distribution of an observation O ( o | s ) w.r.t each state, e.g., Fig.5(Left) illustrates that a large value of o is more likely to be ob-served under s rather than s and s , so we can distinguish s bythe large value of the expectation E [ O ( o | s )] . According to suchdifferent expectations of each observation, we can easily explainthe characteristics of each state.In our ad system, the observations reflecting a user’s intentmainly include the number of exposure, click and purchase of thead to the user, as well as how the user behaves in different scenarios.More concretely, our observations are that: pv gi and clk gi representhow many previous exposure and clicks of an ad have been made in Good Items (similar for pv gw and clk gw in Guess What You Like ), and scen describes how frequently a user switches to other scenarios.Here, we neglect purchase observations as the data is too sparse.Table 2 lists the learned parameters in Section 6.1.2 w.r.t theseobservations, so now we can interpret each state as following:State s is an awareness state since the users under s are ob-served to have little advertising exposure and clicks, particularlyin Guess What You Like . ( E [ O ( o | s )] ≈ E [ O ( o | s )] ≈ s is an interest state because we observe a large number of userbrowsing and click behaviors in this state, especially in Good Items ( E [ O ( o | s )] = . E [ O ( o | s )] = . s ,state s is more active because the users are more likely to switch For better explanation, we slightly abuse the notation in this section. We marginalizeout O ( o | s , a ) and T ( s ′ | s , a ) for all a to obtain O ( o | s ) and T ( s ′ | s ) . We define E [ O ( o | s , a )] = (cid:205) i O ( o i | s , a ) o i where o i is the observed value. luster1 Cluster2 Cluster30.00.20.40.60.81.0 s s s (a) Clustering statistics.

23 4 51

EMSPA TrajectoryManual bid Trajectory Cluster1Cluster3Cluster2

DISA (b) Evolutionary trajectories of beliefs in 2-d projection.

Reward

EMSPA Manual bid

Trajectory LengthRestrainKeepBoosting

Action

EMSPA Manual bid

DISADISA (c) Advertising actions and rewards.

Figure 6: Belief clustering and the evolutionary trajectories with different strategies. to Guess What You Like while maintaining a relative high levelof browsing behaviors ( E [ O ( o | s )] = . E [ O ( o | s )] = . E [ O ( o | s )] > E [ O ( o | s )] ); this explains that users in s start toactively search for their interested items across different scenar-ios, and thus we label s as a search state . Note that our analysisis compatible with the definition of customer funnel revealed in[1, 8, 11, 22], and the differences are that: 1) our results are data-driven and learned from a validated EM model, and 2) we treat thefinal conversion state as an observable state instead of a latent statethat requires inference.Furthermore, we can also verify our interpretations above by b and T ( s ′ | s ) , depicted in Fig. 5(Right). b tells us that almost 73%of users start from the awareness state, while 24% of users beginwith the search state. T ( s ′ | s ) describes how each state transits: i)awareness s −−−−→ search s −−−→ interest s , and ii) awareness s −−−−→ interest s . These transition routes indicate that a user’sstatus always transits from awareness to interest/search rather thangoing in reverse, which is consistent with our common sense. Based on the inter-pretable state, we can compare the difference of the belief’s evolu-tionary tracks by performing two different advertising strategies(DISA, Manual bid) on the same user trajectories.We collect all the inferred belief vectors and project them into a 2-dim space with PCA techniques as Fig. 6(b). For better visualization,we use K-means to cluster those nodes into 3 clusters with differentcolors so that each cluster is dominated by one type of hidden state,e.g., more than 90% of the belief nodes in cluster1 belong to state s ,depicted as Fig. 6(a). So we can label each cluster with the propertyof each state: cluster 1, cluster 2 and cluster 3 are regarded as an awareness stage ( s ), an search stage ( s ) and an interest stage ( s ) respectively. Furthermore, we compare the average rewardcollected at each stage in Fig. 6(c), which shows the search/intereststage earns much higher rewards than the awareness stage; this inturn proves the rationality of our analysis on each state.Let’s examine a typical trajectory where consumers browse dressitems in Good Items first with 6 requests and then in

Guess WhatYou Like with 2 more requests. Fig. 6(b) gives two evolutionary tra-jectories of their states under the strategy of DISA and the manualbid baseline. We can see that both two trajectories start from theawareness stage and also end in the interest stage, but they getseparated after the 3-rd advertising action. This separation leads tothe main difference of two trajectories: DISA successfully guidesthe hidden state transiting to the search stage while the human bid R e w a r d =1=1.1=1.2=1.3=1.4 R e w a r d Figure 7: Rewards under different β (Left), and rewards un-der different visiting frequencies (Right). baseline does not. We draw the performed actions and correspond-ing rewards in Fig. 6(c), which shows that the boosting actionsin DISA dominate after the 3-rd action. One reasonable explana-tion is that: the boosting action can guarantee the display of aditems and further impact the consumer’s perception on the items,especially in Good Items . Therefore, after the consumer switchesto

Guess What You Like , the repeated boosting on the same itemhelps transit consumer’s state to the search stage, which leads to arelatively higher reward as shown in Fig. 6(c).

The value of β determines the degree ofpunishment for performing the boosting action ( β = β , the agent is easier to use boosting actionto win the bidding, leading to the increase of impressions/cost andfurther reaching low rewards. Large β means fewer impressionopportunities to obtain revenue and will also achieve low rewards.In Fig. 7 (Left), we find β = . λ = λ = T w is set to3 hours since we find 90% conversions are reached within 3 hours. We compare the rewardsunder the items with different visiting frequencies in Fig. 7 (Right).It is clear that the more a user interacts with an item, the morereward is gained. However, when the visiting frequency is less than2, the reward becomes much lower, which can be reasoned that itis hard to transfer users to the interest/search state with only twosteps. It also shows our model works better with a longer sequence.

We conduct online A/B experiments running in the live ad platform.Experiments are run from Oct.26 to Nov. 2 in 2019, which involves a y o f d a y o f d a y o f d a y o f d a y o f d a y o f d a y o f d a y o f % c h a n g e i n R O I Figure 8: Increase in ROI over the control group. randomly sampled 9,165,752 users, 664 advertisers, and 72,381 aditems from 12,401 categories. Our sequential advertising model (ex-perimental group) is trained continuously using all user behaviorsacross 9 scenarios with a lag under 24 hours. The control groupis a deployed production model (Cross Entropy Method, CEM [6])that optimizes for immediate rewards. We allocate the same budgetcost to the control and experimental group for each advertiser (weasked the advertisers for permission to adjust their budgets). Wefocus our discussion on the amount of revenue and ROI of the ad-vertisers. In Fig. 8, we achieved +9.02% of revenues with the samebudget cost (-0.81%), resulting in +9.75% of ROI for the experimentalgroup. As our live results are promising, our algorithm has beenofficially deployed online and allows advertisers to customize theiradvertising strategies.

In this paper, we proposed our DISA to model the sequential adver-tising problem, which optimized the strategies by taking accountof interpretability. We developed POMDP framework in large-scaleindustrial settings to infer hidden states based on the consumer’shistorical behaviors. To best fit our interpretable model, a variant ofSPOVA based on deep neural networks has been proposed to learnvalue function and optimize advertising policies. Many details ofour implementation were provided. The simulation and A/B onlineresults have validated the superiority of the proposed algorithmagainst several DRL baselines. In several cases’ analysis, we tryto interpret the learned hidden states, which are meaningful andconsistent with our business common sense.

REFERENCES [1] Vibhanshu Abhishek, Peter Fader, and Kartik Hosanagar. 2012. Media exposurethrough the funnel: A model of multi-stage attribution.

Available at SSRN 2158421 (2012).[2] Robin Allesiardo, Raphaël Féraud, and Djallel Bouneffouf. 2014. A neural net-works committee for the contextual bandit problem. In

International Conferenceon Neural Information Processing . Springer, 374–381.[3] Craig Boutilier and David Poole. 1996. Computing optimal policies for partiallyobservable decision processes using compact representations. In

Proceedings ofthe National Conference on Artificial Intelligence . Citeseer, 1168–1175.[4] Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, andDefeng Guo. 2017. Real-time bidding by reinforcement learning in display adver-tising. In

Proceedings of the Tenth ACM International Conference on Web Searchand Data Mining . ACM, 661–670.[5] Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-HongTang. 2018. Stabilizing reinforcement learning in dynamic environment withapplication to online recommendation. In

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . ACM, 1187–1196.[6] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein.2005. A tutorial on the cross-entropy method.

Annals of operations research

Proceedings of the 2018 World Wide Web Conference on World Wide Web . International World Wide Web ConferencesSteering Committee, 1939–1948.[8] Anindya Ghose and Vilma Todri. 2015. Towards a digital attribution model: Mea-suring the impact of display advertising on online consumer behavior.

Availableat SSRN 2672090 (2015).[9] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforce-ment Learning to Rank in E-Commerce Search Engine: Formalization, Analysis,and Application. arXiv preprint arXiv:1803.00710 (2018).[10] Eugene Ie, Vihan Jain, Jing Wang, Sanmit Navrekar, Ritesh Agarwal, Rui Wu,Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, et al. 2019.Reinforcement learning for slate-based recommender systems: A tractable de-composition and practical methodology. arXiv preprint arXiv:1905.12767 (2019).[11] Bernard J Jansen and Simone Schuster. 2011. Bidding on the buying funnelfor sponsored search and keyword advertising.

Journal of Electronic CommerceResearch

12, 1 (2011), 1.[12] Wendi Ji and Xiaoling Wang. 2017. Additional Multi-Touch Attribution for OnlineAdvertising.. In

AAAI . 1360–1366.[13] Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018.Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Adver-tising. arXiv preprint arXiv:1802.09756 (2018).[14] Peter Karkus, David Hsu, and Wee Sun Lee. 2017. Qmdp-net: Deep learning forplanning under partial observability. In

Advances in Neural Information ProcessingSystems . 4694–4704.[15] Sven Koenig and Reid G Simmons. 1996. Unsupervised learning of probabilisticmodels for robot navigation. In

Robotics and Automation, 1996. Proceedings., 1996IEEE International Conference on , Vol. 3. IEEE, 2301–2308.[16] M Mahmud. 2010. Constructing states for reinforcement learning. In

Proceedingsof the 27th International Conference on Machine Learning (ICML-10) . 727–734.[17] Andrew Kachites McCallum and Dana Ballard. 1996.

Reinforcement learning withselective perception and hidden state . Ph.D. Dissertation. University of Rochester.Dept. of Computer Science.[18] R Andrew McCallum. 1993. Overcoming incomplete perception with utile dis-tinction memory. In

Proceedings of the Tenth International Conference on MachineLearning . 190–196.[19] Rahul Meshram, Aditya Gopalan, and D Manjunath. 2016. Optimal recommenda-tion to users that react: Online learning for a class of POMDPs. In

Decision andControl (CDC), 2016 IEEE 55th Conference on . IEEE, 7210–7215.[20] et al. Mnih, Volodymyr. 2015. Human-level control through deep reinforcementlearning.

Nature 518, no. 7540 (2015): 529 (2015).[21] Kevin P Murphy. 2000. A survey of POMDP solution techniques. environment

IJCAI , Vol. 95. 1088–1094.[24] Andres C Rodriguez, Ronald Parr, and Daphne Koller. 2000. Reinforcementlearning using approximate belief states. In

Advances in Neural InformationProcessing Systems . 1036–1042.[25] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. 2008. Bayes-adaptivepomdps. In

Advances in neural information processing systems . 1225–1232.[26] Nicholas Roy, Geoffrey Gordon, and Sebastian Thrun. 2005. Finding approximatePOMDP solutions through belief compression.

Journal of artificial intelligenceresearch

23 (2005), 1–40.[27] Xuhui Shao and Lexin Li. 2011. Data-driven multi-touch attribution models.In

Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 258–264.[28] Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2018.Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforce-ment Learning. arXiv preprint arXiv:1805.10000 (2018).[29] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural net-works for language modeling. In

Thirteenth annual conference of the internationalspeech communication association .[30] Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu,and Kun Gai. 2018. Budget constrained bidding by model-free reinforcementlearning in display advertising. In

Proceedings of the 27th ACM InternationalConference on Information and Knowledge Management . ACM, 1443–1451.[31] Shuai Yuan and Jun Wang. 2012. Sequential selection of correlated ads byPOMDPs. In

Proceedings of the 21st ACM international conference on Informationand knowledge management . ACM, 515–524.[32] Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016.Deepintent: Learning attentions for online advertising with recurrent neuralnetworks. In

Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining . 1295–1304.[33] Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai.2018. Learning Tree-based Deep Model for Recommender Systems. arXiv preprintarXiv:1801.02294 (2018).[34] Pengfei Zhu, Xin Li, Pascal Poupart, and Guanghui Miao. 2018. On improvingdeep reinforcement learning for pomdps. arXiv preprint arXiv:1804.06309 . SUPPLEMENTARYA.1 Derivation of parameter learning for DISA

Let us consider a discrete extended HMM in session 5.1 with length L . Let the space of observations, hidden states, and actions be M , N ,and A respectively. Given a sequence of observations O = { o ∼ o L } and corresponding actions A = { a ∼ a L − } , a POMDP model isparameterized by a extended HMMs with θ = ( b , T , O ) . Specifically, b ( i ) = P ( s = i ) is the initial state distribution, T ji , k = P ( s t + = j , s t = i | a t = k ) is the transition function, and O i , k ( j ) = P ( o t = j , s t = i | a t − = k ) is the observation function. The Q-function isdefined as the expectation term that we need to maximize: Q ( θ , θ t ) = q ( s ) (cid:213) s ∈S log P (O , s |A ; θ t ) = E q ( s ) log P (O , s |A ; θ t ) A.1.1 Extension of Baum-Welch procedures.

We extend Baum-Welchprocedure for estimating θ ∗ from O and A . Our method can bedescribed as repeating the following steps until convergence:(1) E-step: compute Q ( θ , θ t ) = (cid:205) s log [ P (O , s |A ; θ )] P ( s |O , A ; θ t ) (2) M-step: set θ t + = arg max θ Q ( θ , θ t − ) Firstly, noting that P ( s , O|A) = P ( s |O , A) P (O|A) , we can writethe Q function as ˆ Q ( θ , θ t ) = (cid:205) s log [ P (O , s |A ; θ )] P ( s , O|A ; θ t ) since P (O|A) does not affect the maximization of Q in M-step.Now the P (O , s |A ; θ ) is easy to write: P (O , S|A ; θ ) = P ( o ∼ o L , s ∼ s L | a ∼ a L − ; θ ) = b ( s ) L (cid:214) t = T s t s t − , a t − L (cid:214) t = O s t , a t − ( o t ) Taking the log gives us:log P (O , S|A ; θ ) = log b ( s ) + L (cid:213) t = log T s t s t − , a t − + L (cid:213) t = log O s t , a t − ( o t ) Plugging this into ˆ Q ( θ , θ t ) , we getˆ Q ( θ , θ t ) = (cid:213) s log b ( s ) P ( s , O|A ; θ t ) + (cid:213) s L (cid:213) t = log T s t s t − , a t − P ( s , O|A ; θ t ) + (cid:213) s L (cid:213) t = log O s t , a t − ( o t ) P ( s , O|A ; θ t ) Note that parameters are subjective to the constraints: (cid:213) s ′ T s ′ s , a = (cid:213) o O os , a = (cid:213) s b ( s ) = L ( θ , θ t ) be the Lagrangianˆ L ( θ , θ t ) = ˆ Q ( θ , θ t ) − λ b (cid:32) N (cid:213) i = b ( i ) − (cid:33) − N , A (cid:213) i , k = λ T i , k (cid:169)(cid:173)(cid:171) N (cid:213) j = T ji , k − (cid:170)(cid:174)(cid:172) − N , A (cid:213) i , k = λ O i , k (cid:169)(cid:173)(cid:171) M (cid:213) j = O i , k ( j ) − (cid:170)(cid:174)(cid:172) First let us focus on the b ( i ) . Let ∂ ˆ L ( θ , θ t )/ ∂ b ( i ) = ∂ ˆ L ( θ , θ t )/ ∂ λ b =

0, we obtain: b ( i ) = P ( s = i |O , a ; θ t ) Following a similar process for the b , we have: T ji , k = (cid:205) Lt = P ( s t − = i , s t = j |O , a t − = k ; θ t ) (cid:205) Lt = P ( s t − = i |O , a t − = k ; θ t ) The final thing is O i , k ( j ) , which is slightly trickier, let I ( x ) denotesan indicator function which is 1 if x is true, 0 otherwise. Similarwith T ji , k , we finally get: O i , k ( j ) = (cid:205) Lt = P ( s t = i |O , a t − = k ; θ t ) I ( x t = j ) (cid:205) Lt = P ( s t = i |O , a t − = k ; θ t ) For brevity, we use simple denotations γ ( i , j , k ) = P ( s t − = i , s t = j |O , a t − = k ; θ t ) and γ ( i , k ) = (cid:205) Nj = γ ( i , j , k ) = P ( s t − = i |O , a t − = k ; θ t ) . Note that γ ( i , j , k ) and γ ( i , k ) are both quantities and can becomputed efficiently by a variant of forward-backwards algorithmfor extended HMMs. A.1.2 Inference of extended HMMs.

In order to compute the γ ( i , j , k ) ,we need to solve the forward-backward pass, and the γ algorithmin extended HMMs. Forward pass : We use notations α ( s t , a t − ) ( t < L ) to representthe probability of being in hidden state s t given observations o ∼ o t and conditioned on a ∼ a t − , α ( s t , a t − ) = P ( o ∼ o t , s t | a ∼ a t − ) = (cid:213) s t − α ( s t − , a t − ) T s t s t − , a t − O s t , a t − ( o t ) where α ( s , a ) = b ( s ) O s , a ( o ) Backward pass : Similarly, we use notations β ( s t ) ( t < L ) torepresent the probability of observing o t + ∼ o L conditioned on s t and a t ∼ a L − , β ( s t , a t ) = P ( o t + ∼ o T | s t , a t ∼ a L − ) = (cid:213) s t + β ( s t + , a t + ) T s t + s t , a t O s t + , a t ( o t + ) where β ( s L − , a L − ) = (cid:205) s L T s L s L − , a L − O s L , a L − ( o L ) γ algorithm : after we recursively compute α ( s t , a t − ) and β ( s t , a t ) for each s t , we can easily obtain a γ ′ ( s t , s t + , a t ) which is used tocompute γ ( s t , s t + , a t ) , γ ′ ( s t , s t + , a t ) = P ( s t , s t + , o ∼ o L | a ∼ a L − ) = α ( s t , a t − ) T s t + s t , a t O s t + , a t ( o t + ) β ( s t + , a t + ) Finally, we have: γ ( i , j , k ) = P ( s t − = i , s t = j |O , a t − = k ; θ t ) = γ ′ ( s t − = i , s t = j , a t − = k ) (cid:205) Ni = (cid:205) Nj = γ ′ ( s t − = i , s t = j , a t − = k ) A.2 Experiment Details

A.2.1 Observation and Action Settings.

To know the accumulatedeffect of a user repeated action in different scenarios, we also havea few features on top of the basic features. In specific, for eachuser trajectory, we compute the accumulated pv , pCVR and click to represent how many previous impressions and clicks have been lgorithm 1: DISA Init a Q-network Q a ( b ; η a ) for each action a and a trajectoryreplay memory D ; Init the estimator state with parameters θ = ( T , O , b ) ; for e = 1 to E do Sample M trajectories J = { J ∼ J m } from D ; Construct O = { o ∼ o T } , A = { a ∼ o T − } from J ; for i=1 to I do θ i = arg max θ E p ( s |O , A ; θ i − ) log p ( s , O|A ; θ i − ) untillog p ( s , O|A ; θ i − ) does not increase ; end Update the state estimator θ e = . × θ e − + . × θ I ; Create a new user trajectory J for a category; for t=1 to T do Get the current observation o t and previous b t − , a t − ; Perform belief updating b t = SE ( b t − , a t − , o t ; θ e ) ; With probability ϵ select a random action a t otherwiseselect a t = arg max a Q ( b t ; η a ) ; Execute a t and receive an reward r t , and store atransition ⟨ b t , o t , a t , r t ⟩ into the trajectory J ; Sample a minibatch of N transitions ⟨ b j , a j , r j , b j + ⟩ from all trajectories in D ; Update η a by minimizing loss with Eq. (9) end Update the trajectory replay memory D = { D ∪ J } end made to a user for a category in different scenarios. Let gw denotesthe the subscript of all accumulated features in Guess What You Like while gi denotes that in Good Items , thus we have 6 more observationfeatures: pv gw , pCVR gw , clk gw , pv gi , pCVR gi , clk gi . In total, we use a31-dim vector to describe a data record, which includes item-relatedfeatures, session-related features, and accumulated features.Since we are modeling on the categorical level, we use an aggre-gation method to summarize the features of the items that belongto the same category as Fig. 9. Then, the observation for each cate-gory is described by a vector of statistical features, e.g., the mean,max, min, and standard deviation of each item-level observations.To speed up the calculations, the agent feeds in all the categoricalfeatures of a request as a learning/execution batch, and outputs thecorresponding actions.The recurrent model is implemented by one stack layer LSTMwith the hidden size of 256, and we unroll the LSTM cell in amaximum sequence length of 25. At each time-step, the simulatoroutputs a 2-dim vector representing the probability of click andpurchase, which are optimized by real user feedbacks. Based onthe training results of the simulator, we choose several importantfeatures to the infer of a user hidden state, which contain price , bid , pCTR , pv gw , clk gw , pCVR gw , pv gi , clk gi , pCVR gi and scen . Towork with a discrete conditional HMM, we use a quantile-baseddiscretization for each observed feature.In particular, the pv gi is discretized into a range of [ , ] , andthe clk gi is discretized into a range of [ , ] . For the pv gw , we usethe range [ , ] , and clk gw is mapped into a range of [ , ] . The Cate. 1 features Boosting ( 𝛿 > 1 ) Keep ( 𝛿 = 1 )Restrain( 𝛿 < 1 )Cate. 2 featuresCate. 3 features Batch InputOutput Revert rank score in item-level

Item1Item2Item3Item4Item5

Item6

Item7 𝑠𝑐𝑜𝑟𝑒 ×𝛿 𝑠𝑐𝑜𝑟𝑒 ×𝛿 𝑠𝑐𝑜𝑟𝑒 × 𝛿 𝑠𝑐𝑜𝑟𝑒 ×𝛿 𝑠𝑐𝑜𝑟𝑒 ×𝛿 𝑠𝑐𝑜𝑟𝑒 ×𝛿 𝑠𝑐𝑜𝑟𝑒 ×𝛿 Aggregate features in category-level

Agent

Figure 9: Batch training and execution in categorical level.

Method Parameter Revenue Cost ROI RewardManual bid - 100% 100% 100% 100%DISA γ =0.1, n=5 100.3% 98.2% 102.2% 107.9% γ =0.3, n=5 111.7% 100.7% 110.9% 115.9% γ =0.5, n=5 113.8% 101.1% 112.5% 117.5% γ =0.7, n=5 119.0% 101.9% 116.7% 122.1% γ =0.9, n=5 120.9% 100.7% 120.0% 125.2% γ =0.9, n=1 110.6% 100.2% 110.3% 113.3% γ =0.9, n=2 117.1% 102.0% 114.8% 122.3% γ =0.9, n=3 117.6% 101.6% 115.8% 122.7% γ =0.9, n=4 119.5% 103.3% 115.6% 124.6% γ =0.9, n=5 120.9% 100.7% 120.0% 125.2% Table 3: Hyper-parameter tunning in DISA

Epochs C o s t DISAEM-DQNManual bidDQNADRQNBandit 0 25 50 75 100 125 150 175 200

Epochs R O I DISAEM-DQNManual bidDQNADRQNBandit

Figure 10: The learning curves of cost and ROI. scen will equal to 1 if the current scenario is under

Guess WhatYou Like otherwise equal to 0. Note that the discretization will notaffect the features’ monotonicity, e.g., pv gw = pv gw = Guess What YouLike ; scen = . GuessWhat You Like than scen = . δ =10, δ =1and δ =0.1 respectively. For an ad item, the boosting action with δ =10 can almost guarantee to win the bidding, while the restrainingaction with δ =0.1 can almost prevent its winning of the bidding.Since the distribution of user hidden states is stationary and willnot migrate over time in our experiment, the learned parametersare fixed while optimizing the agent’s policies. A.2.2 Policy Learning with Trajectory Replays.

The off-policy RLis identical to our problem because the agent passively respondsto user requests, and the next request might come from a differentuser. Thus, for every user and category pair, we rely on a trajectoryreply pool to store the corresponding experience tuple, used forconstructing transition samples. The updating of the state estimatoris performed along with the policy learning to cover the patternsof newly arrived user trajectories. For each Q-network, the targetetwork freezing technique is also adopted to stabilize the learningprocess. The training of DISA is formalized as Algorithm 1.

A.2.3 Hyper-parameter Tunning.

The discount factor γ determinesthe importance of future rewards. In Table 3, we find almost all themethods will perform better as γ increases from 0 to 0.5. This resultshows the existence of the future delayed rewards and proves themulti-step decision-making property of our problem. The valueof n in Eq. (9) decides how many regions the belief space willbe split. When n is 1, the belief value function for each action is represented by a linear hyperplane, and with the increase of n , thevalue function will be represented by more hyperplanes, leadingto a finer and more accurate belief region. In our experiments, bytuning on γ and n , we find the hyper-parameters of γ = . n = illustrates the learningprocess of different methods with the best parameter setting, fromwhich we can see that our method DISA achieves higher ROI andalso converges faster than others.7