[PDF] Surveys without Questions: A Reinforcement Learning Approach

Abstract

The 'old world' instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers' online interactions; failing to capture a customer's interactions over time since the rating is a one-time snapshot; and inability to tie back customers' ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer's online interactions, by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.

Full PDF

SSurveys Without Questions: A Reinforcement Learning Approach

Atanu R Sinha ∗† , Deepali Jain ∗‡ , Nikhil Sheoran , Sopan Khosla , Reshmi Sasidharan Adobe Research, India Adobe, [email protected], [email protected], { sheoran, skhosla, rsasidha } @adobe.com Abstract

The ‘old world’ instrument, survey, remains a tool of choicefor ﬁrms to obtain ratings of satisfaction and experience thatcustomers realize while interacting online with ﬁrms. Whileavenues for survey have evolved from emails and links topop-ups while browsing, the deﬁciencies persist. These in-clude - reliance on ratings of very few respondents to in-fer about all customers’ online interactions; failing to cap-ture a customer’s interactions over time since the rating is aone-time snapshot; and inability to tie back customers’ rat-ings to speciﬁc interactions because ratings provided relateto all interactions. To overcome these deﬁciencies we extractproxy ratings from clickstream data, typically collected forevery customer’s online interactions, by developing an ap-proach based on Reinforcement Learning (RL). We introducea new way to interpret values generated by the value func-tion of RL, as proxy ratings. Our approach does not needany survey data for training. Yet, on validation against ac-tual survey data, proxy ratings yield reasonable performanceresults. Additionally, we offer a new way to draw insightsfrom values of the value function, which allow associatingspeciﬁc interactions to their proxy ratings. We introduce twonew metrics to represent ratings - one, customer-level andthe other, aggregate-level for click actions across customers.Both are deﬁned around proportion of all pairwise, succes-sive actions that show increase in proxy ratings. This intu-itive customer-level metric enables gauging the dynamics ofratings over time and is a better predictor of purchase thancustomer ratings from survey. The aggregate-level metric al-lows pinpointing actions that help or hurt experience. In sum,proxy ratings computed unobtrusively from clickstream, forevery action, for each customer, and for every session can of-fer interpretable and more insightful alternative to surveys.

With all the changes on the frontiers of the online world,one thing remains the same. An old-world instrument, sur-vey, is still a tool of choice for ﬁrms obtaining ratings fromcustomers about their degree of satisfaction and level of ex-perience with online interactions on ﬁrms’ websites. To ob-tain customer ratings, the relentless use of surveys may be ∗ Co-lead authors † Corresponding author ‡ Performed this work at Adobe Research, now at Google AICopyright c (cid:13) surprising in the new online world where so many ruleshave been rewritten. To be sure, avenues for survey haveevolved from emails and links to pop-ups while browsing.That said, regardless of how surveys are conducted, ﬁrmsrely upon asking questions of customers to obtain ratingsof satisfaction and experience with online interactions. Un-fortunately, very few customers respond. With increasinglycommon pop-up surveys reporting response from less thanone percent of customers in our data, the sampling bias is ap-parent. Our proposed approach, applied on clickstream datacollected by sites for every visitor, provides proxy-ratingsof online interaction experience for one hundred percentcustomers, without asking questions. Henceforth, we use customer-rating for numerical feedback obtained throughsurveys, and proxy-rating for derived-feedback computedfrom clickstream data. We address three major deﬁcienciesof surveys: (i) Customer ratings obtained from very few re-spondents comprise the basis on which all customers’ rat-ings for online interactions are inferred. The difference inratings between the small percent who rate and the large ma-jority of customers who do not is difﬁcult to account and of-ten ignored. (ii) Surveys constitute a blunt instrument sincea customer’s rating cannot be tied back to her speciﬁc onlineinteractions; instead, relates to all past interactions. E.g., if acustomer performs a sequence of 10 click actions at whichtime the survey appears and she responds, that rating can-not be tied back to speciﬁc click action(s); but, relates to thewhole sequence of 10 actions. (iii) Even for the few cus-tomers whose ratings are known, a survey is a one-time,snapshot rating since the same customer may not respondmore often even if surveyed. Hence, survey ratings do notcapture dynamics of online interaction experience over time.We posit that all three deﬁciencies can be overcome by con-sidering customers’ choice of browsing actions in a decisiontheoretic manner, using Reinforcement Learning (RL) (Sut-ton 1988) and interpreting the value function in a new way.We show that RL can facilitate drawing rich insights aboutindividual customers and speciﬁc site interactions, which ex-tends RL research (Sutton and Barto 2017) in a new direc-tion.Consider an e-commerce platform. A customer typessearch words, applies ﬁlters, performs clicks to get spe-ciﬁc product information and dives into details on few prod-ucts. She may add to cart and may end up purchasing. Our a r X i v : . [ c s . A I] J un remise is that since customers’ ratings are best mappedto customers’ behaviors, and those behaviors manifest inclick actions, the clickstream data can provide useful sig-nals of ratings. The marketing science literature informs thatcustomers are decision oriented in their browsing behaviors(Moe 2006). We posit they are forward looking, learn frompast and current click actions to choose future click actions,keeping in mind their eventual goals, such as search or mak-ing a purchase. This long view may include successive ses-sions where learned information from one session helps acustomer decide whether or where to start browsing in thenext session.We model this decision orientation using RL. Given agoal and a reward function, the value function of our RLmodel generates value of being in a state, for every state,and for every customer. States map to history of past statesand click actions. Thus, we have values corresponding toevery click action, for each customer, given her sequence ofclick actions. We interpret the values as signals of her rat-ing at each click action, or, proxy rating at each click action.First, to test this interpretation, we validate proxy ratings ofcustomers against actual survey ratings from the same cus-tomers. These survey ratings ﬁt well with the goal of proxyratings since the survey question in our data speciﬁcallyseeks response on customer’s website experience and not onsatisfaction with product. Importantly, no survey customerratings are part of the model’s training process. Moreover,the customer survey ratings are obtained in a natural manner,as part of pop-up surveys that the website routinely conducts(that is, not collected as an experiment). Second, making useof proxy rating values at each click action, we identify clickactions that increase or decrease ratings, to identify click ac-tions that enhance or hinder good experience. On validationagainst actual survey data the proxy ratings depict reason-ably good performance results. On the task of action identi-ﬁcation, we obtain insightful results about speciﬁc pages onthe website that under-perform. Additionally, we test useful-ness of proxy ratings on an auxiliary task of purchase pre-diction, which shows good performance.We make the following contributions to the literature.One, we extend RL in a new domain of customer ratings,with focus on interpretability and on drawing rich insightsfrom value function. Two, our approach ‘unobtrusively’computes proxy ratings of one hundred percent of cus-tomers. Three, proxy ratings are computed for each click ac-tion of each customer, resulting in identiﬁcation of speciﬁcinteractions which help or hurt customer goals. Four, proxyratings can be obtained for each session of each customer,allowing observation of customer dynamics over time. Five,our approach does not need any survey training data. Con-tributions two through four address the three deﬁciencies ofsurvey described at the onset. Validation of our approachagainst customer ratings from actual pop-up surveys is animportant feature of our work. Research on implicit measurement of satisfaction in thesearch domain focuses on metrics of dwell time, search re-sults click, etc. to improve search outcomes. One ﬁnding is that implicit measurement correlates with explicit, questionbased measure (Kim et al. 2014; Wang et al. 2014), lendingsupport to our thesis. Other work in search satisfaction in-clude a structural learning model incorporating action leveldependencies using structured features (Wang et al. 2014),and studying difﬁculties faced while searching to obtain rel-evant information (Odijk et al. 2015). Deviating from workin search, our problem is about decision making during in-teractions on online platform (Moe 2006) to obtain ratingsof interaction experience, and doing so at the granularity ofevery click action. A recent work in recommendation sys-tems (Zhao et al. 2018) utilizes both explicit and implicitfeedback from click-actions to learn optimal policy throughtrial-and-error of recommending items. Instead, our goal isto measure proxy ratings for the latent construct of experi-ence from sequence of actions. Other research which mineclickstream data for measuring experience includes visual-izations of common paths for site visitors (Liu et al. 2017)and inferring personas of users (Zhang, Brown, and Shankar2016). But, computation of ratings from clickstream is notaddressed.RL has an established literature with extensions in manyavenues (Sutton and Barto 2017). Although interpretabilityof machine learning models is studied (Lipton 2016), inter-pretability of value function in RL for insights about userinteractions is not explored. To our knowledge, the RL liter-ature does not examine computation of customer proxy rat-ings from mere clickstream data. In using clickstream datain RL, one exception is (Jain et al. 2018), which measuresuser experience to predict purchase through a supervised ap-proach that uses purchase data for training. In a departurefrom this work, we uncover proxy ratings at each action, foreach customer and perform direct evaluation against actualsurvey customer ratings, but, without using survey data fortraining.Recurrent Neural Networks (RNN) are used for predic-tion tasks from click actions (Lang and Rettenmeier 2017).Problems studied include predicting sequential clicks forsponsored search (Zhang et al. 2014) and recognizing the se-quence of tweets for purchase prediction (Korpusik, Mandy,and et al. 2016). We use RNN to obtain representation ofstates, capturing the sequential nature of actions, and use asinput for the RL algorithm.

In modeling browsing behaviors (Moe 2006) posits that “Inaddition to the decision of whether to continue searching, theconsumer must also decide which item, if any, to view. Ateach decision point, the consumer must decide what the nextitem to view should be (pp.683)”. We model a customer’sonline browsing in a decision theoretic framework becauseat each click action she decides whether to stop browsing,or continue browsing and if so, which click actions to se-lect. The decisions are conditional on rewards her past ac-tions yield and her expectation of future rewards. For exam-ple, in online shopping rewards are products she discovers,which can be good or poor, resulting in high or low reward.We recognize that a customer learns as she traverses alonger sequence of actions, and that her future sequence of ac-tions may change due to learning and is a function of goals.This process naturally ﬁts with RL. The value generated bythe value function, at each action level, to guide customer’sfuture actions, serves as signal of her reward, which yieldsproxy rating. Now the formal model is presented.

In line with work in RL (Sutton and Barto 2017), we usea Markov process to model customers’ browsing behaviorson an e-commerce site. However, drawing from Section 2.2of (Yu, Mahmood, and Sutton 2017), the Markov process isdeﬁned by using the history of past states, not only the mostrecent state. We augment each action with a vector contain-ing information from history, as described next.Consider a state space, S = { s , s , s , ... } and a rewardfunction r : S → R . At time t , a user in state S t ∈ S re-ceives a reward r ( S t ) . The transition probability function is P ( s i , s j ) = P r ( S t +1 = s j | S t = s i ) . Let the sequence ofclick actions observed in a user’s browsing journey till time t be [ A , A , ...A t ] where A i ∈ A = { a , a , ..., a |A| } . Leta vector (cid:126)h t − of d dimensions encode all the historical infor-mation from the sequence [ A , A , ...A t − ] . Then, the stateat t is represented as a tuple, S t = ( (cid:126)h t − , A t ) . Deﬁne encod-ing function, g : S → R d such that, (cid:126)h = (cid:126) and (cid:126)h t = g ( (cid:126)h t − , A t ) . Here, S t comprises (cid:126)h t − , a ﬁxed dimensioncontinuous vector encoding the history of actions, and A t ,the recent action. The ﬁxed dimension is selected to be 150,based on limited hyper-parameter tuning (50, 100, 150, 200).Note (cid:126)h t does not grow with t , being calculated using the en-coding function g , deﬁned above. The encoder is the hiddenstate of an RNN trained to predict next action. The choice of reward function r , is important in RL. Weassume a website can deﬁne r based on its objectives anddomain knowledge. A site does not know a customer’s goal.However, the site assigns reward for its objective that alignswith that of the customer’s goal; e.g., making a purchase.Alternative actions of customers may be of interest to a site.A website may want to monitor the click action cart addi-tion for re-targeting purposes and this action can be assigneda high reward by the site. Or, if the site is interested in pur-chase , this action is assigned a high reward. We assume asimple reward function for this implementation: r ( S t ) = (cid:26) , if A t = Purchase , otherwise. (1)Since purchase is of interest to a site and also a well deﬁnedcustomer goal we assign the purchase-action a value of 1,and for simplicity, in the absence of domain speciﬁc knowl-edge, assign zero values to every other action. With beneﬁtof domain knowledge an alternative formulation could as-sign different rewards across actions and goals. We deﬁne the value of any state, V ( S t ) as the total expecteddiscounted reward after t , under the state-transition proba- Figure 1: Illustration of Model-Free Approachbility distribution, P ( . ) . V ( S t ) = E ( r ( S t +1 ) + γr ( S t +2 ) + γ r ( S t +3 ) + ... ) (2)where, γ ∈ [0 , is the discounting factor. The above ex-pression can be written in the form of Bellman Equation asfollows V ( S t ) = E ( r ( S t +1 ) + γV ( S t +1 )) (3)The encoder is an LSTM model trained to predict next ac-tion by using the sequence of click-actions as input. Thesecategorical actions are embedded into a latent space of di-mension 150 and fed as input to an LSTM layer that mini-mizes categorical cross-entropy loss on next action predic-tion. The hidden-representation obtained from output of theLSTM layer (at each timestep of the sequence) along withthe action constitute the state representation.To solve Equation 3, we use an approach based on TD-learning (Sutton 1988) in which the values of V(.) are es-timated in a model-free fashion directly from the stream ofevents. In this scenario, the transition function, P(.) is notknown to the model. Now consider a user transitioned from S t to S t +1 and only current reward r ( S t +1 ) is observed. Wemake the following update to the current estimate of valuefunction based on this observation. V (cid:48) ( S t ) = r ( S t +1 ) + γV ( S t +1 ) T D t = V (cid:48) ( S t ) − V ( S t ) V ( S t ) = V ( S t ) + α ( T D t ) (4)where, α is the learning rate and V (cid:48) is the estimate of valuebased on the new observation. The difference in this new es-timate and the current value is called the temporal-difference(TD) error. We update the current V in the direction of thenew estimate. After a sufﬁciently large number of observa-tions the estimates converge to their ﬁxed value. Figure 1shows the schema.For efﬁciency and generalization, we use a parameterizedestimation of the value function. We deﬁne an estimationfunction f θ with a set of parameters θ such that, f θ ( S t ) = ˆ V ( S t ) ˆ= V ( S t ) (5)Values of θ are randomly initialized to θ . The optimum val-ues are estimated through gradient descent until convergenceon the TD error computed in Equation 4 is attained. .4 New Interpretation of V(.) and Metrics forProxy Ratings Let us deﬁne the k-th customer’s observed journey as J ( k ) =[ A , A , ..., A m ] and her proxy rating for action A t as y ( k ) A t .The latter is equal to the computed value V ( S t ) at time t.Note that m varies across customers. Consistent with thepremise in marketing literature that satisfaction and expe-rience ratings are interpreted as change from expectations(Parasuraman, Zeithaml, and Berry 1985), the change ofproxy ratings going from one action to the next is used asan indicator of actual ratings. We deﬁne a binary classiﬁerfor proxy ratings as follows. Given actions A t − q and A t ,we consider lag(q) as a change in proxy ratings from A t − q to A t . An increase in proxy ratings is assumed positive, as-signed a value 1, and a decrease is assumed negative andassigned a 0. For k-th customer, deﬁne lag(q) as: z ( k ) A t − q ,A t = (cid:40) , if y ( k ) A t - y ( k ) A t − q > , otherwise. (6)We introduce a new metric for ratings, labeled, Propor-tion of Good Ratings and deﬁned as proportion of all pair-wise, successive actions (that is, q = 1 ) that show increase inproxy rating values. This simple metric intuitively capturesthe notion of how often actions lead to better ratings. Thismetric is deﬁned in two ways, Z ( k ) and Z ( a u , a w ) , eachwith its own purpose. Deﬁned for each customer over herjourney, Z ( k ) renders the proportion of her pairwise succes-sive actions that show increase in proxy ratings. For the k-thcustomer, Z ( k ) = 1 | J ( k ) | − | J ( k ) |− (cid:88) t =1 z ( k ) A t − ,A t (7)For a customer performing a sequence of 20 click actions,there are 19 pairwise, successive actions. Say, 11 pairs showincrease in proxy ratings. The proportion Z ( k ) is 11/19 inthis example.The second proportion, Z ( a u , a w ) , is deﬁned for everypair of successive actions ( a u , a w ) and represents the pro-portion of all instances of a pair of successive actions (note, q = 1 ) that show increase in proxy ratings. Z ( a u , a w ) = 1 N ( a u , a w ) K (cid:88) k =1 | J ( k ) − | (cid:88) t =1 z ( k ) A t − ,A t (8)for those t where A t − = a u and A t = a w and N ( a u , a w ) denotes number of instances of successive action-pair ( a u , a w ) in the data. When a pair of successive actions oc-curs in 1000 instances with 350 of them showing increase inproxy ratings, the proportion Z ( a u , a w ) is 350/1000. A cus-tomer can traverse the ( a u , a w ) pair multiple times in a ses-sion, where each pair is a single instance. This customer con-tributes multiple instances to compute Z ( a u , a w ) . To wit,let ( a u , a w ) = (ProductCategory,ProductDetail) . It is naturalfor a customer to go back and forth between these two pagesat different points across the length of a session. We preserve Figure 2: Sequence of actions with survey for k-th customerthis natural phenomenon while computing Z ( a u , a w ) , in-stead of using a single average value for this customer acrossall instances. Use of an average value per customer loses in-formation on variability across instances within a customer. Clickstream data from the website of a consumer electron-ics company are used. The e-commerce site offers productsin many consumer electronics product categories. For conﬁ-dentiality reasons, we cannot disclose the name of the site.After ﬁltering the data, click actions corresponding to the

Laptop category are retained. All click actions, for each cus-tomer, are stitched together chronologically into sequenceof click actions. Altogether relevant click actions suchas product details, search ﬁlter, add to cart etc. are iden-tiﬁed from the data. The set of unique actions is denoted A = { a , a , ..., a } . The ﬁnal data are sets of sequence ofactions. Survey data

Without access to actual customer ratingssurvey data we cannot validate our approach. The validationis crucial to establish our thesis that values in value functionwork as proxy ratings. The survey appears as a pop-up, with-out warning, during some customers’ browsing session. Wedo not observe in the data any systematic pattern about whogets the survey and during which part of a browsing sessionof a customer is the pop-up shown. The number of surveyresponses in the whole data is from more than 8,500 uniquecustomers, constituting 0.7% of all customer journeys.The data curation pipeline is described below. Click ac-tion of

Purchase is identiﬁed in the clickstream data, when-ever a purchase occurs. A customer’s journey may or maynot include purchase; a purchase action indicates end ofjourney. Most journeys do not include purchase. Also, jour-neys that include a purchase may qualitatively differ fromthose without any purchase. Based on these notions, a cus-tomer’s journey can fall into one of the following categories,each of which we want represented:1. One purchase to another purchase.2. Starting in observation period and ending in a purchase.3. Beginning after a purchase in observation period and end-ing without purchase in observation period.4. No purchase throughout observation period.To remove outlier journeys based on length, we restrictto journeys with length in the range 10 to 210 click actionsor hits. Journeys of length less than 10 click actions providelittle signal into the sequence model. The distribution of hitsshow that journeys of less than 210 cover upwards of 96%of all journeys.able 1: Sample Actions and Frequencies

Actions Number

AddToCart 37,233Customize 30,090Home 119,236Prod.Category 180,299

Actions Number

Prod.Detail 262,947Promotion 111,648Search 34,153ViewCart 156,491

Data Sampling

In the ﬁnal dataset, we keep the ratio ofpurchase to no-purchase journeys as 1:2, to guard againstclass imbalance. Number of journeys kept in each categoryis as follows:1. All journeys, roughly 7,500 in number.2. Around 10,000 journeys including all journeys with a sur-vey score.3. Around 15,000 journeys including all journeys with a sur-vey score.4. Around 20,000 journeys including all journeys with a sur-vey score.Each hit in clickstream data is mapped to an action. Theﬁnal data are organized as a set of journey sequences. Eachtimestep in the journey contains information about the actionperformed, time spent, hit timestamp, customer ID and sur-vey score (if the action corresponds to survey response). Thedataset contains around 53,000 journeys from about 46,500customers, with some customers having multiple journeys.Since we over-sample journeys which have a response topop-up survey to obtain sufﬁcient number of customer rat-ings for validation, the sampled data have a higher percent-age of ratings than the 0.7% in the whole data. The dataset issplit randomly into two groups for training ( ) and testing( ). Frequency distribution of a few commonly occurringactions in the training data is shown in Table 1. The trainingdata have altogether 1,896,697 actions. Z ( k ) Figure 3 shows the distribution of Z ( k ) by different journeylengths. Journey lengths vary: up to [25 , , , . We as-sume that proportions in the range (0 . , . do not discrimi-nate between poor and good ratings. Most of the proportionsare in the range (0 . , . across journeys. Focusing on ar-eas to the left and right of this range, we ﬁnd that these ar-eas are not very different from each other, for each journeylength (e.g., 0.08 and 0.19 for journey length 50). We ﬁnd noempirical evidence that longer journeys are associated withpoorer ratings. This is not surprising because customers donot judge satisfaction merely based on reaching an end statequickly, but ﬁnding the right product for them is very impor-tant and satisfying even if the journey is longer. The ﬁndingthat most journeys are in the middle (0 . , . is consistentwith evidence that most customers are in the middle when itcomes to satisfaction and experience and may not respond tosurveys, thereby biasing survey ratings toward the extremes(Krosnick and Presser 2009). Validation Strategy

Crucial to our validation is the pop-up survey. Consider k-th customer’s sequence of click ac-tions and pop-up survey response as shown in Figure 2.The survey pops up unbeknownst to the customer duringher browsing session, provides an instantaneous measure, islikely to prompt top of mind response during browsing, anduseful for online evaluation. It asks “Overall, how wouldyou rate your experience on the [company name] website to-day?” on a 0-10 scale, where 10 is excellent. This questionrelates well to the proxy ratings we compute since these rat-ings are based purely on clickstream data, which are mani-festations of browsing behaviors. The validation we performis:

Do proxy ratings work as leading indicators of survey re-sponses ? The three parts to our experimental evaluation andvalidation are as follows.1. Next action prediction to obtain representations of statescapturing the history of action-sequences.2. Value iteration to obtain proxy ratings for each click ac-tion of each customer.3. Validation of proxy ratings from Part 2 against customerratings from the pop-up survey.The pop-up survey customer ratings are not used for modeltraining in Parts 1 and 2, and in Part 3 only used for ﬁnalvalidation. Implementation of Parts 1 and 2 is explained inSection 3. Now we explain implementation of Part 3. FromFigure 2, the k-th customer performs a few click actions, apop-up appears at A t ( k )+1 , after which more click actionsoccur. Since we want the approach to be applicable acrossdifferent situations we do not want the model to know whenthe survey appears. This is consistent with many online pop-up surveys including the one in this data, where a customerdoes not know while browsing whether and when a pop-upsurvey may appear. Hence, the information about the sur-vey is not treated differently from other click actions forthe purpose of model training. Our premise is that a cus-tomer knows her goal, the set of click actions that are avail-able to her, and she decides which click actions to chooseto reach her goal in a decision theoretic manner. The modelthus computes proxy rating values for all click actions ofthis customer. Also note that we do not want the model toknow which customers receive pop-up survey and who donot. Thus, we compute proxy ratings for all customers in thedata based on Parts 1 and 2. Then, solely for validation inPart 3, we extract data for customers who gave actual sur-vey ratings, since validation can only be done for those cus-tomers. Each customer’s response to the survey is indicatedby A t ( k )+1 . Thus, for validation, we use proxy ratings up toand including A t ( k ) , but do not use ratings for later actions.We do this to test whether proxy ratings up to and including A t ( k ) can predict survey responses with reasonable degreeof accuracy. An occurrence of purchase soon after or muchlater than the pop-up survey does not impact our results.Note that the occurrence of survey-response action A t ( k )+1 in the sequence of customers’ click actions vary across cus-tomer journeys. E.g., the survey may appear after 20 clickactions ( t ( k ) = 20) or after 100 click actions ( t ( k ) = 100) .igure 3: Distribution of Z ( k ) , by journey length (L)Table 2: Validation on test data against Actual SurveyVariation Accuracy Recall Precision F1-Score lag (1) lag (2) A t ( k ) , but not any rating value after A t ( k ) ,where t ( k ) varies across customers. Results presented areaggregated across customers by incorporating this variation.For each customer with a survey response, we consider lag(1) and lag(2) values from Equation 6 for validation.With lag(1) (substituting q = 1 in Equation 6), if z ( k ) A t − ,A t =1 we expect the actual survey rating to be good ; while if z ( k ) A t − ,A t = 0 we expect a poor rating. Similarly, we char-acterize lag(2) ( q = 2 in Equation 6). For classiﬁcation ofsurvey score, we use a simple approach. The 0-10 scale ofthe pop-up survey has a natural mid-point 5, per design ofthe scale. We assign 0-4 as poor , and 6-10 as good . Validation Results

With survey ratings and change inproxy ratings classiﬁed as good and poor, we create a con-fusion matrix across all respondent customers and evaluatewith common metrics such as, precision, recall, accuracy,and F1. The results from the test data are presented in Ta-ble 2. The ﬁrst row uses z ( k ) A t − ,A t . The second row uses z ( k ) A t − ,A t . We ﬁnd accuracy varying between 0.63 and 0.66,recall 0.74 to 0.80, precision 0.68 to 0.69, and F1 0.71 to0.74. We note that these numbers are reasonable, but are nothigh relative to domains of prediction and recommendation.That said, (i) we do not use survey data to train the model;and (ii) we posit an RL model with purchase as the only goal,while customers arrive on a site with other goals, e.g., seek-ing information. If customers can be grouped by goals anddifferent goal-speciﬁc rewards assigned, we expect perfor-mance metrics to improve. Identiﬁcation of customer-goalsbecomes an interesting research problem in its own right. Insummary, we show that ratings uncovered from clickstreamwork as reasonable proxy for actual survey responses, withthe large beneﬁt of being obtained for every customer, andfor every session or journey. Action Identiﬁcation Strategy

Having shown that usefulproxy ratings can be obtained from clickstream, we now ex- Table 3: Actions impacting experienceSource Target Purch. Z ( a u , a w ) Customize ProductCategory No . ± . Customize ProductCategory Yes . ± . Customize ProductDetail No . ± . Customize ProductDetail Yes . ± . Home ProductCategory No . ± . Home ProductCategory Yes . ± . Home ProductDetail No . ± . Home ProductDetail Yes . ± . amine whether useful insights can be drawn for individualactions that customers perform on a site. By interpretingvalue function outputs as proxy ratings and introducing anew metric, we offer a systematic approach to identify ac-tions that hinder or help toward better ratings. Since proxyratings are computed for every click action customers per-form, our approach provides insights to websites by identi-fying appropriate click actions which may require correctivemeasure. For example, if it is found that a pair of succes-sive actions ( a c , a d ) results in poor proxy score across mostcustomers, it behooves examining this sequence for proba-ble corrective measure such as making the click action a d less readily available when a c is clicked. We perform actionidentiﬁcation by conﬁning to pairs of successive actions, thatis, sequences of length two. Alternatively, we can considersequences of length greater than two. But, it is difﬁcult to at-tribute a speciﬁc pair of action to a poor rating or a good rat-ing. As an illustration, if the sequence ( a c , a d , a e , a f ) yieldspoor rating, we cannot attribute which among ( a d , a e , a f ) behoove attention without additional analysis.For each pair of successive actions we compute the pro-portion of good ratings Z ( a u , a w ) , separately for journeysthat include purchase and for those which do not include pur-chase. This is important because we want to avoid the poten-tial confound of ratings with whether customers end up witha purchase. Thus, for each of purchase and no-purchase con-ditions, with 46 unique actions in our data, we have a 46x46matrix, or, 2116 cells. However, since all action pairs are notobserved in the data (e.g., (Home, Add To Cart) ), empiri-cally we have 1861 populated cells. Note that the successiveactions in a pair can be identical; e.g., (ProductDetail, Pro-ductDetail) . Most action-pairs are not popularly traversed byusers. A site can choose to focus on action-pairs traversedabove a threshold value.igure 4: Distribution of Correlations between y ( k ) A t and probability of purchase, by journey length (L)Figure 5: Distribution of Correlations between Z ( k ) and probability of purchase, by journey length (L) Action Identiﬁcation Results

Table 3 conveys results forselected sample pairs as a way to illustrate the types ofinterpretation and insight these proxy ratings can provide.The results are presented in pairs of rows, correspondingto (

No,Yes ) in

Purchase column, for every pair of succes-sive actions (

Source, Target ). Reviewing the last column Z ( a u , a w ) we note that if a proportion is close to 0.50 itis not discernible for action identiﬁcation, since good andpoor proportions are similar. We make two types of com-parison: (a) for each successive pair of actions ( Source,Target ), compare proportions for those who purchase andthose who do not; (b) for each type of customer groups -non-purchasers and purchasers, compare proportions acrossaction-pairs. Considering (a), the proportions do not showsystematic differences between purchase and no-purchasegroups, although these differences are statistically signiﬁ-cant at 5%. In three action-pairs purchasers show a lowerscore, while higher in one pair (

Customize, ProductDetail ).Across all 1861 pairs in the data, we ﬁnd no systematic dif-ferences between these two groups, providing support thatour approach does not necessarily associate proportion ofgood ratings with purchase. This is consistent with the lit-erature that satisfaction and experience are not simply aboutpurchase (Lemon and Verhoef 2016) and also shows that ourapproach to proxy ratings is not biased toward purchase. Fo-cusing on (b), ﬁrst consider action-pairs (

Customize, Pro-ductCategory ) and (

Customize, ProductDetail ) for

Purchase= Yes . Moving from

Customize to ProductDetail generatessigniﬁcantly more good proxy ratings than moving from

Customize to ProductCategory (mean values 0.77 versus0.44). This can be interpreted using the well-known conceptof purchase funnel (Hoban and Bucklin 2015), where transi-tion along a funnel moves consumers from one stage to thenext; e.g.,

Home to ProductCategory to Customize to Pro-ductDetail . If the transition moves to a stage further aheadby skipping a stage, that may go against a smooth transi-tion. To wit,

Customize is a more detail-oriented activity. A move from

Customize to ProductCategory is a regressivestep, while moving to

ProductDetail is a progressive step.Thus, the latter is associated with higher good proxy rat-ing. Next, consider action-pairs (

Home, ProductCategory )and (

Home, ProductDetail ). Within each of these pairs of ac-tions, the no-purchase and purchase conditions show meanvalues that are numerically close, namely, (0.68, 0.63) and(0.22, 0.16), respectively. Across action-pairs, for both no-purchase and purchase conditions, the values are different.Going from

Home directly to

ProductDetail , by skipping

ProductCategory , is being interpreted as less fulﬁlling (0.22,0.16), because often customers backtrack to

ProductCate-gory and re-start the funnel. The (

Home, ProductCategory )is a natural progression and favorable scores (0.68, 0.63)support that interpretation. Thus, from (b), the differencesshow that proxy scores can be interpreted to yield usefulinsights which are associated with customer behavior in amarketing funnel.

We validate proxy ratings against actual survey scores andidentify relevant actions using proportions. As an auxiliarygoal, we now address purchase prediction since it is of com-mon interest. In the following paragraphs we discuss twopurchase prediction tasks - one, predicting purchase at eachtimestep of click action for every customer, and two, pre-dicting whether a journey ends up with a purchase.The benchmark model for task one is an LSTM trainedin a supervised manner to predict probability of purchase atevery timestep. For each timestep, our model yields proxyrating as well as proportion of good ratings. For everycustomer-journey, we compute correlation across time steps,between probability of purchase and proxy rating y ( k ) A t . Fig-ure 4 shows distributions of correlations by lengths of jour-ney. The correlations are centered around zero as uni-modaldistributions. With journeys of length up to 25, the variances large, but decreases with longer journeys. We also com-pute, for every customer-journey, correlation across timesteps, between probability of purchase and proportion ofgood ratings Z ( k ) . Figure 5 shows these correlations formbi-modal distributions and depict different patterns from thatof Figure 4. This suggests that the proportion Z ( k ) is a bet-ter discriminator than proxy rating y ( k ) A t in relating to prob-ability of purchase at each timestep. Additionally, our enu-merated metrics are distinct from purchase probabilities atevery timestep and capture information about customer in-teractions that purchase probability models do not.Coming to task two, we know which journeys end up inpurchase. We enumerate Z ( k ) for every customer, across theperson’s whole journey. Then we use Z ( k ) to predict whetheror not purchase occurs at the end of the journey. We obtainan AUC = 0.73. If we use actual customer survey scores topredict purchase, an AUC = 0.51 is obtained, which is nobetter than random prediction. Note that for the latter weconﬁne only to customers who provide survey ratings, mostof whom do not purchase. The metric Z ( k ) appears useful topredict eventual purchase in a journey. Last but not the least,yet another value-add from our model is that proxy ratingscan be computed every time a customer browses on a siteand thus her satisfaction and experience over sessions andjourneys can be ascertained. This unobtrusive measure canprovide early warning through downward trend in her proxyratings. This cannot be done using surveys.In conclusion, we show that proxy ratings and derivedmetrics of proportion are interpretable and insightful and canserve as reasonably good alternatives to surveys. With muchadvancement in online presence of ﬁrms, surveys have re-mained prevalent despite well-known deﬁciencies. We offera way out of this situation through a reinforcement learn-ing based extrication of proxy ratings from easily availableclickstream data. This research takes RL in a new directionto better understand interactions in customer data and bringsRL a step closer to realizing its potential in the online ﬁrm-customer interaction domain. References [Hoban and Bucklin 2015] Hoban, P. R., and Bucklin, R. E.2015. Effects of internet display advertising in the purchasefunnel: Model-based insights from a randomized ﬁeld ex-periment.

Journal of Marketing Research

Advancesin Knowledge Discovery and Data Mining , 475–487. Cham:Springer International Publishing.[Kim et al. 2014] Kim, Y.; Hassan, A.; White, R. W.; and Zi-touni, I. 2014. Modeling dwell time to predict click-levelsatisfaction. In

Proceedings of the 7th ACM InternationalConference on Web Search and Data Mining , WSDM ’14,193–202. New York, NY, USA: ACM.[Korpusik, Mandy, and et al. 2016] Korpusik; Mandy; andet al. 2016. Recurrent neural networks for customer pur- chase prediction on twitter.

ACM Conference on Recom-mender Systems .[Krosnick and Presser 2009] Krosnick, J. A., and Presser, S.2009. Question and questionnaire design.[Lang and Rettenmeier 2017] Lang, T., and Rettenmeier, M.2017. Understanding consumer behavior with recurrent neu-ral networks.

International Workshop on Machine LearningMethods for Recommender Systems .[Lemon and Verhoef 2016] Lemon, K. N., and Verhoef, P. C.2016. Understanding customer experience throughout thecustomer journey.

Journal of Marketing

CoRR abs/1606.03490.[Liu et al. 2017] Liu, Z.; Wang, Y.; Dontcheva, M.; Hoffman,M.; Walker, S.; and Wilson, A. 2017. Patterns and se-quences: Interactive exploration of clickstreams to under-stand common visitor paths.

IEEE Transactions on Visu-alization and Computer Graphics

Journal of Marketing Research

Proceedings of the 24th ACM Interna-tional Conference on Information and Knowledge Manage-ment , 1551–1560.[Parasuraman, Zeithaml, and Berry 1985] Parasuraman, A.;Zeithaml, V. A.; and Berry, L. L. 1985. A conceptual modelof service quality and its implications for future research.

Journal of Marketing

Reinforcement Learning: An Introduction . MIT Press,Cambridge, 2nd edition.[Sutton 1988] Sutton, R. S. 1988. Learning to predict by themethods of temporal differences.

Machine learning

Proceedings of the 37th international ACM SIGIR confer-ence on Research & development in information retrieval ,123–132. ACM.[Yu, Mahmood, and Sutton 2017] Yu, H.; Mahmood, A. R.;and Sutton, R. S. 2017. On generalized bellman equationsand temporal-difference learning. In

Canadian Conferenceon Artiﬁcial Intelligence , 3–14. Springer.[Zhang et al. 2014] Zhang, Y.; Dai, H.; Xu, C.; Feng, J.;Wang, T.; Bian, J.; Wang, B.; and Liu, T.-Y. 2014. Se-quential click prediction for sponsored search with recurrentneural networks. In

AAAI , 1369–1375.[Zhang, Brown, and Shankar 2016] Zhang, X.; Brown, H.-F.; and Shankar, A. 2016. Data-driven personas: Construct-ing archetypal users with clickstreams and user telemetry. In