Novelty and Primacy: A Long-Term Estimator for Online Experiments
Soheil Sadeghi, Somit Gupta, Stefan Gramatovici, Jiannan Lu, Hao Ai, Ruhan Zhang
NNovelty and Primacy: A Long-Term Estimator for OnlineExperiments
Soheil Sadeghi, Somit Gupta, Stefan Gramatovici, Jiannan Lu, Hao Ai, Ruhan ZhangMicrosoftFebruary 26, 2021 a r X i v : . [ c s . H C ] F e b bstract Online experiments are the gold standard for evaluating impact on user experience and acceleratinginnovation in software. However, since experiments are typically limited in duration, observedtreatment effects are not always permanently stable, sometimes revealing increasing or decreasingpatterns over time. There are multiple causes for a treatment effect to change over time. In thispaper, we focus on a particular cause, user-learning, which is primarily associated with noveltyor primacy. Novelty describes the desire to use new technology that tends to diminish over time.Primacy describes the growing engagement with technology as a result of adoption of the innovation.User-learning estimation is critical because it holds experimentation responsible for trustworthiness,empowers organizations to make better decisions by providing a long-term view of expected impact,and prevents user dissatisfaction. In this paper, we propose an observational approach, based ondifference-in-differences technique to estimate user-learning at scale. We use this approach to testand estimate user-learning in many experiments at Microsoft. We compare our approach with theexisting experimental method to show its benefits in terms of ease of use and higher statisticalpower, and to discuss its limitation in presence of other forms of treatment interaction with time.
Keywords : A/B testing; difference-in-differences; user-learning; user experience; trustworthiness2. INTRODUCTIONOnline experiments (e.g., A/B tests) are the gold standard for evaluating impact on user expe-rience in websites, mobile and desktop applications, services, and operating systems (Kohavi &Round 2004; Scott 2010; Tang, Agarwal, O’Brien & Meyer 2010; Scott 2015; Urban, Sreenivasan &Kannan 2016; Kohavi & Longbotham 2017; Kaufman, Pitchforth & Vermeer 2017; Li, Dmitriev, Hu,Chai, Dimov, Paddock, Li, Kirshenbaum, Niculescu & Thoresen 2019; Kohavi, Tang & Xu 2020).Tech giants such as Amazon, Facebook, and Google invest in in-house experimentation systems,while multiple start-ups like Optimizely help other companies run A/B testing. At Microsoft, theexperimentation system provides A/B testing solutions to many products including Bing, Cortana,Microsoft News, Office, Skype, Windows and Xbox, running thousands of experiments per year.The usefulness of controlled experiments comes from their ability to establish a causal relationshipbetween the features being tested and the changes in user response. In the simplest controlled exper-iment or A/B test, users are randomly assigned to one of the two variants: control (A) or treatment(B). Usually control is the existing system, and treatment is the existing system with a new featureX. User interactions with the system are recorded and metrics are computed. If the experimentwas designed and executed correctly, the only thing consistently different between the two variantsis the feature X. External factors such as seasonality, impact of other feature launches, competitormoves, etc. are distributed randomly between control and treatment, and therefore do not impactthe results of the experiment. Hence, any difference in metrics between the two groups must be dueto the feature X. For online experiments where multiple features are tested simultaneously, morecomplicated designs are used to establish a causal relationship between the changes made to theproduct and changes in user response (Haizler & Steinberg 2020; Sadeghi, Chien & Arora 2020).This is the key reason for widespread use of controlled experiments for evaluating impact on userexperience for new features in software.Having the right metrics is critical to successfully executing and evaluating an experiment (Deng& Shi 2016; Machmouchi & Buscher 2016). The overall evaluation criteria metric plays a key role inthe experiment to make a ship/no-ship decision (Kohavi, Longbotham, Sommerfield & Henne 2009).The metric changes observed during the experiment (typically few weeks or few months) are notalways permanently stable, sometimes revealing increasing or decreasing patterns over time. Thereare multiple causes for a treatment effect to change over time. In this paper, we focus on oneparticular cause, user-learning, which was first proposed in Thorndike’s Law of Effect (Thorndike1898). According to this law, positive and negative outcomes reinforce the behaviors that causedthem. User-learning and statistical modeling first came together in the 50s (Estes 1950; Bush &Mosteller 1951). In online experiments with changing treatment effect over time, user-learning isprimarily associated with novelty or primacy effect. Novelty effect describes the desire to use newtechnology that tends to diminish over time. On the contrary, primacy effect describes the growing3ngagement with technology as a result of adoption of the innovation. These effects have beendiscussed in multiple fields by many studies (Anderson & Barrios 1961; Peterson & DuCharme1967; Jones, Rock, Shaver, Goethals & Ward 1968; Bartolomeo 1997; Tan & Ward 2000; Howard& Crompton 2003; Feddersen, Maennig, Borcherding et al. 2006; Poppenk, Köhler & Moscovitch2010; Li 2010; Kohavi, Deng, Frasca, Longbotham, Walker & Xu 2012; Mutsuddi & Connelly 2012;Hohnhold, O’Brien & Tang 2015; Van Erkel & Thijssen 2016; Dmitriev, Frasca, Gupta, Kohavi &Vaz 2016; Belton & Sugden 2018; Chen, Liu & Xu 2019).User-learning estimation and understanding the sustained impact of the treatment effect iscritical for many reasons. First, it holds experimentation responsible for preventing overestimationor underestimation in the case of novelty or primacy. Second, it empowers organizations to makebetter decisions by providing them a long-term view of expected changes in the key metrics. Often,experiments show gain in one key product metric and loss in another. In this case, the productowners need to trade-off two metrics to make a ship decision. This can lead to a wrong decision ifthe sustained treatment effect is different from the observed treatment effect. Third, it ensures thatthe experiment is not causing user dissatisfaction even though the key metrics might have movedin the positive direction. At times, undesirable treatments, that cause distraction or confusionamong users, may initially lead to an increase in some metrics indicative of higher engagement.For instance, in an experiment shared in Dmitriev, Gupta, Kim & Vaz (2017), there was a bug intreatment which led to users seeing a blank page. This resulted in a huge spike in the number ofimpressions from that page, as users tried to refresh the page multiple times to see if that wouldhelp them see any contents on the page.To motivate this paper, let us consider an experiment from Dmitriev et al. (2017) on the Mi-crosoft News homepage where the treatment replaced the Outlook.com button with the Mail appbutton on the top stripe (msn.com experiment in Figure 1). The experiment showed a 4.7% in-crease in overall clicks on the page, a 28% increase in the number of clicks on the button, and a27% increase in the number of clicks on the button adjacent to the Mail app button. Ignoring anyconcerns about novelty effect, this would seem like a great result.However, novelty effect likely exists in this experiment. Looking at each day segment, we foundthat the difference between number of clicks on the Mail app (in treatment) and Outlook.com (incontrol) were decreasing rapidly day over day (see Figure 2). We believe that the treatment causeda lot of confusion to the users who were used to navigating to Outlook.com from the msn.com page.When the button instead started opening the Mail app, some users continued to click on the buttonexpecting it to work like it used to. They may have also clicked on the button adjacent to Mail appbutton to check if other buttons work. Overtime, users learned that the button has changed andstopped clicking on it. Had this treatment been shipped to all users, it could have caused a lot ofuser dissatisfaction. In fact, we shut down the experiment mid-way to avoid user dissatisfaction.4igure 1: Screenshots of treatment with the Mail app button (left), and control with the Out-look.com button (right) in the msn.com experiment.Figure 2: The percentage difference in number of clicks on the Mail/Outlook.com button, each day,between treatment and control in the msn.com experiment.In this case, we had a sound hypothesis for the cause of novelty effect. This hypothesis fits wellwith many observations made above. However, we may not be that fortunate for the hundreds ofother experiments, or for many other key business metrics. At Microsoft, we run experimentationat a large scale. Typically, more than a 1000 online experiments with tens (sometimes hundreds)of metrics are run at Microsoft each month (Gupta, Kohavi, Deng, Omhover & Janowski 2019).Therefore, we need methods for estimating, testing, and utilizing user-learning to estimate thelong-term impact of feature changes at scale.The methodology currently used in industry for user-learning estimation is based on an exper-imental approach first proposed by Hohnhold et al. (2015). This approach provides an unbiasedestimate of user-learning by adding a significant operational changes to the experimentation system.It also requires a large pool of experimental units at the beginning of the experiment to be randomlydivided into multiple cohorts and to be assigned to treatment in a ladder form. This approach is5sually used for select few experiments where the feature team suspects user-learning a priori and iswilling to have a complex experimental design setting to estimate it. It is practically more effectiveto estimate user-learning without any changes in the experimentation system.In this paper, we propose an observational approach, based on the well-known difference-in-differences technique (Abadie 2005; Athey & Imbens 2006; Donald & Lang 2007; Conley &Taber 2011; Dimick & Ryan 2014) to estimate user-learning at scale. We use this approach to de-tect user-learning in many experiments at Microsoft. Our formulation is powerful in quickly testingfor the presence of user-learning even in short duration experiments. The main advantage of ourproposed methodology is that it provides a practically more effective way to estimate user-learningby eliminating the need for the experimental design setting required by Hohnhold et al. (2015).Additionally, our proposed approach provides more statistical power for testing the significance ofuser-learning compared to the existing approach. We further illustrate this with a simulation study.The main disadvantage of our proposed methodology is that, although it provides an unbiased esti-mate of the long-term treatment effect, user-learning estimation is more susceptible to other formsof treatment interaction with time (e.g., seasonality). Practically in controlled experiments, havinga large treatment and seasonality interaction effect that significantly biases user-learning estimationis rare. Further, more advanced techniques such as time series decomposition of seasonality can beused to reduce the bias in user-learning estimation.In general, we recommend using observational approach to test for the presence of user-learning.In the case where user-learning is significant, we usually recommend running the experiment longerto allow for the treatment effect to stabilize (Dmitriev et al. 2016). If the user-learning is graduallychanging over time, we recommend running the experiment long enough and utilizing observationalapproach to construct user-learning time series that can be extrapolated to estimate the long-termtreatment effect. In cases, where we suspect strong seasonality interaction with the treatmenteffect, and the feature team is willing to use a larger sample size with more complicated setting,experimental approach can be useful.The remainder of the paper is organized as follows. In Section 2, we first formulate the problemand discuss a natural way to visually check for the presence of user-learning. In Section 3, wereview the existing experimental approach for user-learning estimation. Next, we propose a newobservational approach, based on difference-in-differences technique, and compare our methodologyto the existing method in Section 4. In Section 5, we illustrate user-learning estimation usinganother Microsoft experiment and a simulation study. We conclude in Section 6.2. FORMULATIONWithout loss of generality, let us consider an A/B test in which n experimental units (e.g., browsercookies, devices, etc.) are randomly divided in two cohorts based on the hash of experimental unit6d (Kohavi & Longbotham 2017). We assign one cohort to control and the other to treatment. Letthe experiment duration consists of k − time windows (e.g., days, weeks, months, etc.). We usuallyhave one or multiple weeks to account for day-of-the-week effects (Kohavi et al. 2020; Dmitrievet al. 2016). For a metric of interest y , we define C j and T j to be the sample mean y for the controland treatment cohorts in time window j , j = 1 , · · · , k − , respectively. For a given t , where t = 1 , · · · , k − , we define the estimated treatment effect, (cid:98) τ t , as follows: (cid:98) τ t = (cid:88) j ≤ t t ( T j − C j ) . (1)Figure 3 visually displays the aforementioned A/B test.Figure 3: A/B test during k − time windowsNote that different experimental units may get exposed to the experiment at different timewindows. Further, after being exposed, there may be time windows where these units do not usethe product. For the experimental units that are missing for some time windows, we can taketwo approaches in treatment effect estimation. The first approach is to impute zero for missingvalues in metrics where missing can be interpreted as zero, e.g., number of clicks, time spent, anddollar amount. This approach allows us to use all experimental units in all time windows for theanalysis which benefits from the high statistical power in testing effect significance. However, forsituations where imputation is not feasible, in each time window, we only include the experimentalunits for which we observe the metric. In this case, we assume that the missing value distributionsare consistent between the control and treatment cohorts. Practically, this assumption is feasiblefor majority of randomized experiments .The A/B test in Figure 3 is limited in duration (typically few weeks or few months), and theobserved treatment effect (cid:98) τ t is not always permanently stable, sometimes revealing increasing ordecreasing patterns over time. There are multiple causes for a treatment effect to change over This depends on the metric being computed. Metrics that do not include all users are more likely to be affectedby sample ratio mismatch. There may also be some cases where the treatment impacts the propensity of a unit toreturn to a product more (or less) often leading to a sample ratio mismatch in a time window (Fabijan, Gupchup,Gupta, Omhover, Qin, Vermeer & Dmitriev 2019). (a) Novelty effect (b) Primacy effect
Figure 4: A/B test with novelty or primacy effectFor an experimental unit that gets exposed to treatment at time window t + 1 , define δ t − t tobe the user learned effect from the ( t + 1) st time window to the t th time window. By this definition δ t is the user learned effect from the first time window ( t = 0 ) to the t th time window and δ = 0 .Let us assume that E( y t ) = µ t has a linear form, µ t = α + β t + ( τ + δ t − t ) I τ , (2)where α is the intercept, β t is the t th time window main effect, τ is treatment main effect, δ t − t isuser-learning, and I τ is an indicator which equals to 1 if the metric is measured in the treatmentcohort. This is a reasonable assumption as in cases where there are nonlinear effects, this modelcan be considered as a first order approximation of the Taylor series expansion.Detecting user-learning and understanding the long-term treatment effect is critical while makinga ship decision. The most intuitive approach is to look at the time series (cid:98) τ , (cid:98) τ , · · · , (cid:98) τ k − and see if8here exists an increasing or decreasing pattern (Chen et al. 2019). If the time series treatment effectis permanently stable (see Figure 5a), any (cid:98) τ t , where t = 1 , · · · , k − , can be viewed as the long-termimpact of the feature change. If there exist an increasing or decreasing pattern and the pattern hasconverged at time t < k − (see Figure 5b), then (cid:98) τ k − can be viewed as the long-term impact of thefeature change. However, for the situations where there exists an increasing or decreasing patternand it has not converged during the experiment, we cannot quantify the long-term effect simply bylooking at the time series (cid:98) τ , (cid:98) τ , · · · , (cid:98) τ k − . To illustrate this, we revisit the msn.com experiment inFigure 1. (a) Permanently stable (b) Decreasing pattern with convergence Figure 5: Behavior of treatment effect estimate during the A/B testFigure 6a shows the time series (cid:98) τ , (cid:98) τ , · · · , (cid:98) τ k − and the corresponding confidence interval overeach day of the experiment for total number of clicks. In this Figure, the confidence interval on thefirst day overlaps with the confidence interval on almost every other day. Further, the treatmenteffect is significant on first day (0.244, [0.129, 0.360]), tends to decline and become insignificant inthe next few days, and becomes significant again on the last day. Therefore, there is a need for amore rigorous statistical approach to test the significance of user-learning.One of the factors in treatment effect fluctuation of Figure 6a is that users are exposed to theexperiment at different dates. Thus, it may seem more appropriate to provide visualization basedon days of exposure. Figure 6b shows the estimated treatment effect for total number of clicks basedon exposure days. Although this graph better visually conveys the presence of novelty, it does notyet provide a statistical test for its significance. First, the confidence interval on the first exposureday overlaps with the confidence interval on almost all other exposure days. Second, number ofusers with t exposure days decreases as t increases. This leads to increase in confidence intervals forhigher exposure days which makes it harder to detect novelty. Third, the set of users who have 59 a) Calendar date (b) Exposure days Figure 6: Estimated treatment effect time series for total number of clicks in the msn.com experi-ment.exposure days are likely the most engaged/loyal users. The presence of any heterogeneous treatmenteffect that interacts with the loyalty level of users can lead to increasing or decreasing trend whichis not related to user-learning (Wang, Gupta, Lu, Mahmoudzadeh & Liu 2019). Studying the sub-population estimates of the treatment effect comes with its own pitfalls: it is not representative ofthe entire user base (Wang et al. 2019), and it will have lower statistical power to detect a changedue to decrease in sample size (Kohavi et al. 2009). Therefore, we need a methodology to estimateuser-learning and to statistically test for its significance. In the next section, we review an existingexperimental approach used in industry to tackle user-learning.3. EXISTING METHODOLOGYIn this section, we review an experimental design approach that was proposed by Hohnhold et al.(2015) for user-learning estimation of the A/B test in Figure 3 (remember that A/B test durationconsists of k − time windows). The purpose of this approach is to create a time series thatprovides an unbiased estimate of δ t . For simplicity, we first develop the concept in the case wherethe experiment duration is divided into two time windows ( k = 3 ) and show why the estimateduser-learning is unbiased. We then expand the concept to the more general case with k > .3.1 Experimental Approach to User-Learning with k = 3 For developing the concept in the simplest setup where the experiment duration consists of twotime windows ( k = 3 ), we randomly divide the n experimental units into cohorts. We assign thefirst cohort to control, and the second cohort to treatment. The third cohort’s assignment switchesfrom control to treatment in the second time window. We use index i to refer to cohort, index j to10efer to time window, and T ji or C ji to refer to sample mean y of the cohort i in time window j ifit is assigned to treatment or control, respectively (see Figure 7). Following Equation (1), we canestimate the treatment effect time series with (cid:98) τ = T − C and (cid:98) τ = [( T − C ) + ( T − C )] . Toestimate δ , define (cid:98) δ as follows: (cid:98) δ = T − T . (3)Figure 7: A/B test with three cohorts and two time windowsNext, we show that (cid:98) δ in Equation (3) is an unbiased estimate of δ . Remember that Equation (2)poses a linear form on µ t where µ t = α + β t + ( τ + δ t − t ) I τ . Since the second cohort is assigned totreatment on the first time window, E( T ) = α + β + τ + δ . (4)In addition, since the third cohort is assigned to treatment on the second time window, E( T ) = α + β + τ. (5)Therefore, E( (cid:98) δ ) = E( T − T ) = δ . (6)In other words, T and T are similar in all respects except for the fact that experimental unitsin T have been exposed to treatment for a longer time period compared to those of T . If thereare statistically significant differences between T and T , then we can attribute the difference touser-learning.3.2 Experimental Approach to User-Learning with k > To expand the concept to the more general case with k > , we randomly divide the n experimentalunits into k cohorts (the intent is to have as many cohorts to cover all k − time windows). Weassign the first cohort to control, the second cohort to treatment, and denote T ji or C ji to refer to11ample mean y of the cohort i in time window j if it is assigned to treatment or control, respectively.Following Equation (1), we can estimate the treatment effect time series with (cid:98) τ t = (cid:80) j ≤ t t ( T j − C j ) ,where t = 1 , · · · , k − . To estimate the user-learning, we switch the assignment of cohort i ≥ from control to treatment in a ladder form (see Figure 8). We then estimate δ t as follows: (cid:98) δ t = T t − T tt +1 . (7)Figure 8: A/B test with k cohorts and k − time windowsUsing Equation (2), a similar argument can show that (cid:98) δ t in Equation (7) is an unbiased estimateof δ t . However, this estimator is not unique and other calculations can also provide an unbiasedestimate of δ t . For example, we can use a cross-sectional approach in the same setup to constructthe estimated user-learning time series. In this approach, δ t is estimated with, (cid:98) δ t = T k − k − t +1 − T k − k . (8)In addition to the time series (cid:98) τ , (cid:98) τ , · · · , (cid:98) τ k − for treatment effect, this process also provides usa time series (cid:98) δ , (cid:98) δ , · · · , (cid:98) δ k − to estimate user-learning. However, it adds a significant operationalchanges to the experimentation system. It also requires a large pool of experimental units at thebeginning of the experiment to be randomly divided into multiple cohorts and to be assigned totreatment in a ladder form. This approach is usually used for select few experiments where thefeature team suspects user-learning a priori and is willing to have a complex experimental designsetting to estimate it. It is practically more effective to estimate user-learning without any changesin the experimentation system. In the next section, we propose an observational approach, basedon difference-in-differences to estimate user-learning at scale. The purpose of our approach is toeliminate the need for the aforementioned experimental design setting. we use this approach toestimate user-learning for experiments at Microsoft.12. PROPOSED METHODOLOGYIn this section, we propose an observational approach for user-learning estimation of the A/B testin Figure 3 (remember that A/B test duration consists of k − time windows). Our proposedapproach is based on the well-known difference-in-differences (DID) technique (Abadie 2005; Athey& Imbens 2006; Donald & Lang 2007; Conley & Taber 2011; Dimick & Ryan 2014). We use DIDto detect user-learning in many experiments at Microsoft. Our formulation is powerful in quicklytesting for the presence of user-learning even in short duration experiments. Similar to the priorsection, we first develop the concept in the case where the experiment duration is divided into twotime windows ( k = 3 ) and show why the estimated user-learning is unbiased. We then expandthe concept to the more general case with k > . The purpose of this approach is to estimateuser-learning without the need for the experimental design setting we discussed in the prior section.Later in this section, we show that our proposed approach provides more statistical power for testingthe significance of user-learning compared to the existing approach. We also discuss its drawback,and provide guidance on how to mitigate it practically.4.1 Observational Approach to User-Learning with k = 3 For developing the concept in the simplest setup where the experiment duration consists of twotime windows ( k = 3 ), we randomly divide the n experimental units into cohorts. We assign thefirst cohort to control, and the second cohort to treatment. Following Equation (1), we can estimatethe treatment effect time series with (cid:98) τ = T − C and (cid:98) τ = [( T − C ) + ( T − C )] . To estimate δ , define (cid:98) δ as follows: (cid:98) δ = ( T − T ) − ( C − C ) . (9)Figure 9: A/B test with two cohorts and two time windowsNext, we show that (cid:98) δ in Equation (9) is an unbiased estimate of δ . Remember that Equation (2)poses a linear form on µ t where µ t = α + β t + ( τ + δ t − t ) I τ . Since the second cohort is assigned totreatment on the first time window, E( T ) = α + β + τ + δ E( T ) = α + β + τ. (10)13urther, since the first cohort is assigned to control on the first time window, E( C ) = α + β E( C ) = α + β . (11)Therefore, E( (cid:98) δ ) = E( T − T ) − E( C − C ) = δ . (12)4.2 Observational Approach to User-Learning with k > Here we use the exact setup of the A/B test in Figure 3, and estimate δ t with, (cid:98) δ t = ( T t − T ) − ( C t − C ) . (13)Using Equation (2), a similar argument can show that (cid:98) δ t in Equation (13) is an unbiased estimateof δ t . In addition to the time series (cid:98) τ , (cid:98) τ , · · · , (cid:98) τ k − for the treatment effect, this approach alsoprovides us a time series (cid:98) δ , (cid:98) δ , · · · , (cid:98) δ k − to estimate user-learning. In fact, this approach providesestimation without any changes in the experimentation system.4.3 Statistical Test and Long-Term EstimationThe advantage of creating a time series to estimate user-learning is that we can statistically testits significance in the face of new technology, and utilize it to estimate the long-term impact ofthe treatment effect. To statistically test for the significance of user-learning, under Gaussianassumption, we can construct the (1 − α ) confidence interval (cid:98) δ t ± φ − (1 − α/ ∗ (cid:100) Var( (cid:98) δ t ) where φ − is the normal inverse cumulative distribution function. User-learning is statistically significantif zero is outside the confidence interval. Next, we compare Var( (cid:98) δ t ) from the experimental andobservational approaches to see which approach provides more statistical power. For the metric ofinterest y , let us assume that the variance of each experimental unit within each time window is σ . Let us also assume that the correlation of the metric for each experimental unit in two timewindows is ρ > . To include all the experimental units in the analysis, we impute the missingvalues with zero (assuming the metric of interest allows for such interpretation of missing values).For simplicity, let us also assume that n experimental units are equally divided between cohorts inboth experimental and observational approaches. For the experimental approach, since each cohortincludes n/k experimental units, Var( T t ) = kσ /n Var( T tt +1 ) = kσ /n. (14)Thus, the variance of (cid:98) δ t in Equation (7) equals to Var( (cid:98) δ t ) = Var( T t − T tt +1 ) = 2 kσ /n. (15)14or the observational approach, since each difference T t − T or C t − C is calculated in the samecohort with n/ experimental units, Var( T t − T ) = 2 σ /n + 2 σ /n − ρσ /n = 4(1 − ρ ) σ /n Var( C t − C ) = 2 σ /n + 2 σ /n − ρσ /n = 4(1 − ρ ) σ /n. (16)Thus, the variance of (cid:98) δ t in Equation (13) equals to Var( (cid:98) δ t ) = Var (cid:0) ( T t − T ) − ( C t − C ) (cid:1) = 8(1 − ρ ) σ /n. (17)Therefore, the observational approach provides more statistical power if ρ > − k which is guaran-teed if k > .The long-term estimate of the treatment effect is the combination of the observed treatmenteffect during the experiment and the limit of estimated user-learning as t → ∞ . Sometimes, it maytake a very long time for user-learning to converge. Therefore, we can use an exponential model tohave an idea of how long it takes for user-learning to converge and to estimate its limit as t → ∞ .In Section 5, we illustrate this with a simulation study.4.4 Quick Detection of User-LearningIn this section we present a slightly modified version of our proposed approach to user-learning. Theformulation discussed so far defines a global time window for all users in estimating user-learning.This implies that the analysis is restricted to users that are assigned to cohorts at the beginningof the experiment. To include users that are assigned to the experiment after it started, a similarformulation can be applied to user level time window: the user level time window begins when theuser is exposed to the experiment and ends when the user leaves the experiment. In the calculationof (cid:98) δ in Equation (9), we then split each user level time window into two halves. This formulationresults in larger sample size and higher statistical power which leads to better inference about hepopulation. This approach can quickly detect the significant presence of user-learning, and while issuitable for cookie-based experiments and benefits from higher statistical power, it is limited to thecase with k = 3 and may not be easily extended to the more general case with k > .We use this approach to detect user-learning in many experiments at Microsoft. This formulationis powerful in quickly testing for the presence of user-learning even in short duration experiments.In the case where user-learning is significant, we usually recommend running the experiment longerto allow for the treatment effect to stabilize (Dmitriev et al. 2016). If the user-learning is graduallychanging over time, we recommend running the experiment long enough and utilizing observationalapproach to construct user-learning time series that can be extrapolated to estimate the long-termtreatment effect.For illustration, we now revisit the msn.com experiment in Figure 1. Following the aforemen-tioned formulation and the calculation of (cid:98) δ in Equation (9), we are able to successfully detect the15resence of novelty effect in total number of clicks in Figure 6 with a statistically significant (cid:98) δ < ( p-value = 0 . ).4.5 DiscussionIn this section we discuss the advantages and disadvantages of both existing and proposed method-ologies. The main advantage of our proposed methodology is that it provides a practically moreeffective way to estimate user-learning without any changes in the experimentation system. Further,as discussed above, this approach provides more statistical power for testing the significance of user-learning compared to the existing approach. The main disadvantage of our proposed methodologyis that it is susceptible to treatment effect interaction with other external factors (e.g., season-ality) that are not related to user-learning. When there is a significant interaction between thetreatment effect and an external factor (e.g., the treatment effect on weekend is different from theweekdays), our estimates are biased representations of user-learning. In this case, although ourapproach provides an unbiased estimate of the long-term treatment effect, it does not distinguishbetween user-learning and other forms of treatment interaction with time. Practically in controlledexperiments, having a large treatment and seasonality interaction effect that significantly biasesuser-learning estimation is rare. Further, such effects can be avoided by running the experiment forlonger period or re-running the experiment at a different time period. More complicated techniquessuch as time series decomposition of seasonality, cyclicality, and trend on user-learning estimatescan also be used to reduce the bias in these situations. Note while we assume a linear form for µ t in Equation (2), the time series (cid:98) δ t ’s can have any functional form. Thus, complex interactions oftreatment effects with external factors can be modeled by more advanced methods, e.g., piecewiselinear regression.There are some known limitations for both methodologies. First, the estimates of user-learningare biased when experimental units are not durable for the period of analysis (e.g., cookie churn) andcan be reset by users leading to their random movements between the cohorts (Hohnhold et al. 2015;Dmitriev et al. 2016). This issue can be mitigated by restricting the analysis to experimental unitsthat are assigned to cohorts at the beginning of the experiment (Hohnhold et al. 2015). Second,the long-term estimate of the treatment effect can be biased due to violations of stable treatmentunit value assumption (Imbens & Rubin 2015). For example, the behavior of one experimentalunit may be influenced by another one because of social network effects or because the same usershows up as multiple experimental units in the experiment (e.g., cookies from different browsers ormultiple devices). Third, the current feature change may interact with other features changes of thefuture product or competing products which could impact the treatment effect (Czitrom 1999; Wu& Hamada 2011). Lastly, there can be external factors that cause user behavior changes that arenot captured in experiments, e.g., changes in user behavior due to COVID-19.16. APPLICATION AND SIMULATIONIn this section, we first illustrate our proposed observational approach for user-learning estima-tion in another real-world Microsoft experiment . We then provide comparison with the existingexperimental approach using a simulation study.5.1 Empirical ExampleWe share another example of an experiment in a Microsoft application. The feature change of thisexperiment impacts first launch experience after the application is updated. The treatment showsa special page informing users about the changes in the application while control shows the regularpage with a notification about the update on the corner. Subsequent launches of the applicationshow the regular page for both treatment and control. The experiment runs for about less than amonth and there are about K users in each control and treatment. The metric of interest is pageviews of the regular page where we expect a significant decrease as the result of treatment. Table 1includes the results of this experiment. We utilize the simplest DID setting discussed in Section 4.4to report (cid:98) δ . Table 1: Empirical results of Microsoft application experimentTime Period (cid:98) τ in % ( p-value ) (cid:98) δ in % ( p-value )First 3 days − . ( e − ) ( . ) ∗ Week 1 − . ( e − ) . ( e − ) ∗ Week 2 − . ( . ) − . ( . )Week 3 − . ( . ) . ( . )We observe a significant drop in number of page views in the first week of experiment. However,the magnitude of treatment effect is decreasing rapidly over time, and by the end of the third week,it is not statistically significant. Indeed, (cid:98) δ detects user-learning in the first week of the experimentindicating that the rate of change in treatment effect is statistically significant. By the end of thethird week, neither the treatment effect nor user-learning is significant which means the featurechange did not have any long-term impact on page views.5.2 Simulation StudyTo compare the observational and experimental approaches, we conduct a simulation study wherewe use the data from the aforementioned Microsoft application experiment, and split it randomlyinto treatment and control cohorts. We then use a Gaussian distribution that has a mean of αe − βt (with α = 1 and β = 1 / ) and standard deviation of 2 to inject a treatment effect into treatment The empirical examples shared in this paper are just a small selection of cases where we have observed significantuser-learning. α and β . Further we bootstrap to estimate the standard error of these estimates. We run thesimulation 1000 times with random splitting of users into different groups each time. Table 2 showsthe estimated values and the standard errors. As discussed in Section 4, the standard error is higherin the experimental approach because it requires 14 cohorts for estimation compared to 2 cohortsrequired in the observational approach.Table 2: Estimates of α and β in the simulation studyMethod α β True Value 1 1/3Observational 1.001 (std. err.: 0.049) 0.339 (std. err. : 0.052)Experimental 1.013 (std. err.: 0.105) 0.581 (std. err.: 2.107)6. CONCLUSIONOnline experiments (e.g., A/B tests) are the gold standard for evaluating impact on user experi-ence in websites, mobile and desktop applications, services, and operating systems. At Microsoft,18he experimentation system supports A/B testing across many products including Bing, Cortana,Microsoft News, Office, Skype, Windows and Xbox, running thousands of experiments per year.The metric changes observed during these experiments (typically few weeks or few months) arenot always permanently stable, sometimes revealing increasing or decreasing patterns over time.There are multiple causes for a treatment effect to change over time. In this paper we focus onone particular cause, user-learning, which is primarily associated with novelty or primacy. Noveltydescribes the desire to use new technology that tends to diminish over time. On the contrary, pri-macy describes the growing engagement with technology as a result of adoption of the innovation.User-learning estimation and understanding the sustained impact of the treatment effect is criticalfor many reasons. First, it holds experimentation responsible for preventing overestimation or un-derestimation in the case of novelty or primacy. Second, it empowers organizations to make betterdecisions by providing them a long-term view of expected changes in the key metrics. Third, itensures that the experiment is not causing user dissatisfaction even though the key metrics mighthave moved in the positive direction.In this paper, we first formulate the problem and discuses a natural way to visually check for thepresence of user-learning. We then review the existing experimental approach used in industry foruser-learning estimation. This approach provides an unbiased estimate of user-learning by addinga significant operational changes to the experimentation system. It also requires a large pool ofexperimental units at the beginning of the experiment to be randomly divided into multiple cohortsand to be assigned to treatment in a ladder form. This approach is usually used for select fewexperiments where the feature team suspects user-learning a priori and is willing to have a complexexperimental design setting to estimate it. It is practically more effective to estimate user-learningwithout any changes in the experimentation system.We propose an observational approach, based on difference-in-differences to estimate user-learning at scale. We use this approach to detect user-learning in many experiments at Microsoft.Our formulation is powerful in quickly testing for the presence of user-learning even in short dura-tion experiments. The main advantage of our proposed methodology is that it provides a practicallymore effective way to estimate user-learning by eliminating the need for the aforementioned exper-imental design setting. Additionally, our proposed approach provides more statistical power fortesting the significance of user-learning compared to the existing approach. We further illustratethis with a simulation study. The main disadvantage of our proposed methodology is that, al-though it provides an unbiased estimate of the long-term treatment effect, user-learning estimationis more susceptible to other forms of treatment interaction with time (e.g., seasonality). Practicallyin controlled experiments, having a large treatment and seasonality interaction effect that signifi-cantly biases user-learning estimation is rare. Further, more advanced techniques such as time seriesdecomposition of seasonality can be used to reduce the bias in user-learning estimation.19n general, we recommend using observational approach to test for the presence of user-learning.In the case where user-learning is significant, we usually recommend running the experiment longerto allow for the treatment effect to stabilize. If the user-learning is gradually changing over time, werecommend running the experiment long enough and utilizing observational approach to constructuser-learning time series that can be extrapolated to estimate the long-term treatment effect. Incases, where we suspect strong seasonality interaction with the treatment effect, and the featureteam is willing to use a larger sample size with more complicated setting, experimental approachcan be useful. ACKNOWLEDGEMENTSWe want to acknowledge our colleagues within Microsoft who have reviewed our work and gavevaluable feedback. We also want to thank our colleagues in Microsoft Experimentation Platformteam, Windows Experimentation team, and Microsoft News team for supporting our work.20EFERENCESAbadie, A. (2005), “Semiparametric Difference-in-Differences Estimators,”
The Review of EconomicStudies , 72(1), 1–19.Anderson, N. H., & Barrios, A. A. (1961), “Primacy Effects in Personality Impression Formation,”
The Journal of Abnormal and Social Psychology , 63(2), 346.Athey, S., & Imbens, G. W. (2006), “Identification and Inference in Nonlinear Difference-in-Differences Models,”
Econometrica , 74(2), 431–497.Bartolomeo, P. (1997), “The Novelty Effect in Recovered Hemineglect,”
Cortex , 33(2), 323–333.Belton, C. A., & Sugden, R. (2018), “Attention and Novelty: An Experimental Investigation ofOrder Effects in Multiple Valuation Tasks,”
Journal of Economic Psychology , 67, 103–115.Bush, R. R., & Mosteller, F. (1951), “A Mathematical Model for Simple Learning,”
PsychologicalReview , 58(5), 313.Chen, N., Liu, M., & Xu, Y. (2019), How A/B Tests Could Go Wrong: Automatic Diagnosis ofInvalid Online Experiments„ in
Proceedings of the Twelfth ACM International Conference onWeb Search and Data Mining , pp. 501–509.Conley, T. G., & Taber, C. R. (2011), “Inference with Difference-in-Differences with a Small Numberof Policy Changes,”
The Review of Economics and Statistics , 93(1), 113–125.Czitrom, V. (1999), “One-Factor-at-a-Time versus Designed Experiments,”
The American Statisti-cian , 53(2), 126–131.Deng, A., & Shi, X. (2016), Data-Driven Metric Development for Online Controlled Experiments:Seven Lessons Learned„ in
Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pp. 77–86.Dimick, J. B., & Ryan, A. M. (2014), “Methods for Evaluating Changes in Health Care Policy: theDifference-in-Differences Approach,”
Jama , 312(22), 2401–2402.Dmitriev, P., Frasca, B., Gupta, S., Kohavi, R., & Vaz, G. (2016), Pitfalls of Long-Term OnlineControlled Experiments„ in , IEEE, pp. 1367–1376.Dmitriev, P., Gupta, S., Kim, D. W., & Vaz, G. (2017), A Dirty Dozen: Twelve Common CetricInterpretation Pitfalls in Online Controlled Experiments„ in
Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pp. 1427–1436.21onald, S. G., & Lang, K. (2007), “Inference with Difference-in-Differences and Other Panel Data,”
The Review of Economics and Statistics , 89(2), 221–233.Estes, W. K. (1950), “Toward a Statistical Theory of Learning,”
Psychological Review , 57(2), 94.Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019),Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: a Taxonomy and Rulesof Thumb for Practitioners„ in
Proceedings of the 25th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pp. 2156–2164.Feddersen, A., Maennig, W., Borcherding, M. et al. (2006), “The Novelty Effect of New SoccerStadia: The Case of Germany,”
International Journal of Sport Finance , 1(3), 174–188.Gupta, S., Kohavi, R., Deng, A., Omhover, J., & Janowski, P. (2019), A/B Testing at Scale:Accelerating Software Innovation„ in
Companion Proceedings of The 2019 World Wide WebConference , pp. 1299–1300.Haizler, T., & Steinberg, D. M. (2020), “Factorial Designs for Online Experiments,”
Technometrics ,pp. 1–12.Hohnhold, H., O’Brien, D., & Tang, D. (2015), Focusing on the Long-Term: It’s Good for Users andBusiness„ in
Proceedings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pp. 1849–1858.Howard, D. R., & Crompton, J. L. (2003), “An Empirical Review of the Stadium Novelty Effect,”
Sport Marketing Quarterly , 12(2).Imbens, G. W., & Rubin, D. B. (2015),
Causal Inference in Statistics, Social, and BiomedicalSciences , Cambridge: Cambridge University Press.Jones, E. E., Rock, L., Shaver, K. G., Goethals, G. R., & Ward, L. M. (1968), “Pattern of Perfor-mance and Ability Attribution: An Unexpected Primacy Effect,”
Journal of Personality andSocial Psychology , 10(4), 317.Kaufman, R. L., Pitchforth, J., & Vermeer, L. (2017), “Democratizing Online Controlled Experi-ments at Booking.com,” arXiv preprint arXiv:1710.08217 , .Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., & Xu, Y. (2012), Trustworthy OnlineControlled Experiments: Five Puzzling Outcomes Explained„ in
Proceedings of the 18th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pp. 786–794.Kohavi, R., & Longbotham, R. (2017), “Online Controlled Experiments and A/B Testing,”
Ency-clopedia of Machine Learning and Data Mining , 7(8), 922–929.22ohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009), “Controlled Experiments onthe Web: Survey and Practical guide,”
Data Mining and Knowledge Discovery , 18(1), 140–181.Kohavi, R., & Round, M. (2004), “Emetrics Summit 2004: Front Line Internet Analytics at Ama-zon.com,”
Amazon.com, Copyright , .Kohavi, R., Tang, D., & Xu, Y. (2020),
Trustworthy Online Controlled Experiments: A PracticalGuide to A/B Testing , Cambridge: Cambridge University Press.Li, C. (2010), “Primacy Effect or Recency Effect? A long-Term Memory Test of Super Bowl Com-mercials,”
Journal of Consumer Behaviour: an International Research Review , 9(1), 32–44.Li, P. L., Dmitriev, P., Hu, H. M., Chai, X., Dimov, Z., Paddock, B., Li, Y., Kirshenbaum, A.,Niculescu, I., & Thoresen, T. (2019), Experimentation in the Operating System: the WindowsExperimentation Platform„ in , IEEE, pp. 21–30.Machmouchi, W., & Buscher, G. (2016), Principles for the Design of Online A/B Metrics„ in
Proceedings of the 39th International ACM SIGIR Conference on Research and Developmentin Information Retrieval , pp. 589–590.Mutsuddi, A. U., & Connelly, K. (2012), Text Messages for Encouraging Physical Activity: AreThey Effective After the Novelty Effect Wears off?„ in , IEEE,pp. 33–40.Peterson, C. R., & DuCharme, W. M. (1967), “A Primacy Effect in Subjective Probability Revision,”
Journal of Experimental Psychology , 73(1), 61.Poppenk, J., Köhler, S., & Moscovitch, M. (2010), “Revisiting the Novelty Effect: When Familiarity,not Novelty, Enhances Memory,”
Journal of Experimental Psychology: Learning, Memory, andCognition , 36(5), 1321.Sadeghi, S., Chien, P., & Arora, N. (2020), “Sliced Designs for Multi-Platform Online Experiments,”
Technometrics , 62(3), 387–402.Scott, S. L. (2010), “A Modern Bayesian Look at the Multi-Armed Bandit,”
Applied StochasticModels in Business and Industry , 26(6), 639–658.Scott, S. L. (2015), “Multi-Armed Bandit Experiments in the Online Service Economy,”
AppliedStochastic Models in Business and Industry , 31(1), 37–45.23an, L., & Ward, G. (2000), “A Recency-Based Account of the Primacy Effect in Free Recall,”
Journal of Experimental Psychology: Learning, Memory, and Cognition , 26(6), 1589.Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010), Overlapping Experiment Infrastructure:More, Better, Faster Experimentation„ in
Proceedings of the 16th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pp. 17–26.Thorndike, E. L. (1898), “Animal Intelligence: an Experimental Study of the Associative Processesin Animals,”
The Psychological Review: Monograph Supplements , 2(4), i.Urban, S., Sreenivasan, R., & Kannan, V. (2016), “It is All A/Bout Testing: The Netflix Ex-perimentation Platform,”
URL: https://medium.com/netflix-techblog/its-all-a-bout-testing-the-netflixexperimentation-platform-4e1ca458c15 , .Van Erkel, P. F., & Thijssen, P. (2016), “The First One Wins: Distilling the Primacy Effect,”
Electoral Studies , 44, 245–254.Wang, Y., Gupta, S., Lu, J., Mahmoudzadeh, A., & Liu, S. (2019), On Heavy-User Bias in A/BTesting„ in
Proceedings of the 28th ACM International Conference on Information and Knowl-edge Management , pp. 2425–2428.Wu, C. J., & Hamada, M. S. (2011),