A Time To Event Framework For Multi-touch Attribution
Dinah Shender, Ali Nasiri Amini, Xinlong Bao, Mert Dikmen, Amy Richardson, Jing Wang
AA Time To Event Framework For Multi-touch Attribution
Dinah Shender ∗ , Ali Nasiri Amini ∗ , Xinlong Bao ∗ , Mert Dikmen ∗ , Amy Richardson † ,Jing Wang ∗ Abstract
Multi-touch attribution (MTA) estimates the relative contributions of the multiple ads auser may see prior to any observed conversions. Increasingly, advertisers also want to basebudget and bidding decisions on these attributions, spending more on ads that drive more con-versions. We describe two requirements for an MTA system to be suitable for this application:First, it must be able to handle continuously updated and incomplete data. Second, it must besufficiently flexible to capture that an ad’s effect will change over time. We describe an MTAsystem, consisting of a model for user conversion behavior and a credit assignment algorithm,that satisfies these requirements. Our model for user conversion behavior treats conversions asoccurrences in an inhomogeneous Poisson process, while our attribution algorithm is based oniteratively removing the last ad in the path.
One of the promises of online advertising has been the ability to tie together ad views or clickswith actual outcomes (e.g. purchases, website visits, etc), also known as conversions. This givesadvertisers insight into both the effectiveness of their ads overall, and the relative effectiveness ofdifferent types of ads. Multi-touch attribution (MTA) achieves this second goal by estimating therelative contributions of the multiple ads a user may see prior to any observed conversions. Withthe right data, MTA can quantify which ads contributed to which types of conversions.Increasingly, advertisers also want to base budget and bidding decisions on these attributions,increasing spend on ads that drive more conversions. Bidding models are constantly updatedwith the latest data, in order to respond to changes in ad effectiveness, business fluctuations, etc.Incorporating MTA results into a bidding system requires an MTA system that is also capable ofingesting continuously updated data. This data is by its nature incomplete: we don’t know howmany users who saw ads yesterday will ultimately convert, only how many have converted so far.Therefore our first requirement is that our MTA system be capable of handling incomplete data.A second requirement is that an MTA system be sufficiently flexible to capture that an ad’simmediate effect on a user is likely to be different than its effect several weeks later. In practice,this tends to mean that conversions occurring soon after an ad are more likely to have been causedby the ad (and therefore the ad tends to receive more credit) than conversions occuring long afterthe ad. ∗ Google † At Google when this work was done a r X i v : . [ s t a t . A P ] S e p o handle both requirements, we propose an MTA system that models how ads affect a user’sconversion rate over time. Ad credit can then be distributed according to how ads change theestimated conversion rate. We propose a modeling approach that treats conversions as occurrencesin an inhomogeneous Poisson process, similar to many survival models used in epidemiology andbiostatistics, also known as a time-to-event model. In the remainder of this paper, we describe ourapproach in detail, as well as how to use the resulting model for attribution. The time-to-eventmethodology inspires the name TEDDA (Time to Event Data Driven Attribution) for our modellingapproach.Our model can be used with either observational data or data from randomized experiments setup to measure the causal (or incremental) effects of ads. Both types of data can be useful, dependingon the goals and context. Observational data can be interpreted in terms of the correlation betweenshowing ads and conversions, while data from randomized experiments allows for a causal estimateof the number of conversions caused by ads. The latter is the gold standard, but these experimentsare often difficult or impossible to run, and may not be available on an ongoing basis. In thesecases, advertisers may prefer to do MTA on correlational data over not having any sort of MTA,particularly if they have evidence, perhaps from previous randomized experiments, that the relativecredit assigned to the ad types of interest is unchanged after accounting for incrementality, even ifthe absolute credit differs. These are choices that the modeller must make based on their knowledgeof the application area, media types, and brands involved. Rather than discussing the pros andcons of observational and experimental data, this paper will focus on the overall system that canbe used to model either type of data.The remainder of this paper is organized as follows: In Section 2 we discuss previous work inthis area. Section 3 goes into further detail about the key issues an MTA system must solve. InSection 4 we present our system, describing both the model for conversion occurrences and theattribution credit assignment methodology. Section 5 discusses how to evaluate the quality andaccuracy of the system. Section 6 discusses possible future improvements to this system. Traditionally marketers have evaluated their advertising results using Media Mix Modeling (MMM),which typically fits time series models to a few years of aggregated conversion data in order tocompare broad classes of ads (e.g. [5, 8]). While this can help in allocating overall marketing spendbetween channels, it cannot provide the kind of granular cross- and within-channel insight intospecific types of ads that MTA obtains by leveraging individual user paths.This more recent line of work on modelling individual user paths can be divided into models thatrely on the sequence of user events and those that incorporate the timestamps of these events. Theformer category includes Markov chain methods such as [10, 1], which treat both conversions andeach ad channel as states in a kth-order Markov chain and estimate the probability of a user movingfrom one state to the next. Attribution credit in these models is based on the change in conversionprobabilities when a channel is removed from the chain. [4] and [11] are similar in depending onlyon the sequence of ad events, but they each fit logistic regression models for a binary conversionoutcome. They then use Shapley values to distribute attribution credit, using the probability of aconversion as the value function in the Shapley algorithm.There are also several previous papers falling into the latter category, i.e. models that incorpo-rate event timestamps to fit continuous time models for user conversion activity. This includes [6],which fits a recurrent neural network with the entire path as an input to estimate the conversion2robability each day. This is then used as input to the Shapley value algorithm to distribute attri-bution credit. [9] uses exponentially decaying ad effects to build a model to estimate the incrementaleffect of ads using experimental data. They use an attribution methodology similar to the one wewill describe to then bid optimally based on that model. Similar to our model, [13] and [12] treatconversions as occurrences in a Poisson process. Both assume an exponentially decaying ad effect.[13] considers a more restricted setting where users can convert at most once, similar to traditionalsurvival analysis, and then fits an additive model that assumes there are no interactions betweenads. [12] models both conversion events and ad events in different channels as a set of mutuallyexciting Poisson processes, allowing ad events to affect not just the conversion probability, but alsothe probability of future ads. However, they consider only the aggregate conversion credit over thedata set, not the credit per ad.
MTA provides critical insights about the relative value of ads, which should affect an advertiser’swillingness to spend on different types of ads. In the context of digital advertising, where adver-tisers compete in an auction to determine whose ad is shown, the fastest way to incorporate thisinformation is to allow it to affect the advertiser’s bid. In order to do this, an advertiser must haveaccess to an MTA system with nearly real-time (e.g. daily) updates. This allows the advertiser tobid in accordance with the attribution results without sacrificing their responsiveness to businessfluctuations, changes in their ads’ effectiveness, or other novel trends.We propose two key requirements that this type of attribution system should satisfy, both relatedto how the system should consider the effects of time.
The system must be able to handle incomplete or censored data.Real-time user path data is fundamentally incomplete: If an ad was shown this morning at10am, and it’s now 6pm and there was no conversion, that does not mean that this ad resulted in 0conversions. Rather, it resulted in 0 conversions over the first 8 hours. We do not know what willhappen over the next day or week. Instead of treating this observation as a 0 or negative response,we should treat this observation as being right-censored.To illustrate the importance of this principle, consider the toy example in Figure 1.If our attribution system treats the data as being complete, so that the conversion outcome isbinary for each user, with no censoring, and we run it now, it will likely find that having ad type 2does not lead to additional conversions: paths with and without ad type 2 convert at equal rates .However, in four hours, it would find that ad type 2 drives 50% more conversions. Such stronglyconflicting results can cause unstable and problematic behavior when used in bidding algorithms.To avoid this, our system needs to recognize that after 4 hours, we not only have new values forthe response, but we also have more overall information about the rate of conversions because we Note that this is not a causal claim about ad type 2 causing the conversions, but rather about the associationsbetween ads and conversions. With observational data, we cannot know if ad type 2 causes more conversions or ifusers who see ad type 2 are simply more likely to convert anyways. If instead this is experimental data where users2 and 4 didn’t see ad type 2 because it was ablated or withheld, then we could turn this into a causal claim. Theimplications for our model are similar. d Type 1Ad Type 1Ad Type 1Ad Type 1User 1 Ad Type 2 ConvUser 2 Conv Now 4 hours laterUser 3 ConvUser 4 Ad Type 2 Figure 1: An example of the effects of incomplete or censored datahave observed the user’s post-ad paths for longer. In other words, we need to recognize that wehave incomplete observations due to censoring.This leads us to consider modeling the number of conversions over time, which would naturallycapture that our estimates are more uncertain “now” than in 4 hours. The next requirement leadsus further in this direction.
An ad’s contribution to any potential conversion should be allowed to differ depending on thetime between the ad exposure and the conversion, typically decreasing as the time between the adexposure and the conversion increases. Similarly, ad credit shouldn’t just depend on the order ofthe ads in the path.Many advertisers already recognize this in an informal way, setting “lookback windows” of N = 7 or 30 days and only distributing conversion credit to ads in the N days before the conversion.However, an ad’s influence likely decays more continuously, so that even within the lookback window,ads further in time from the conversion deserve less credit than those close to it, if all else is equal.While this implies that the order in which ads occur should affect the relative conversion creditthat ads receive, the order is not sufficient: if two identical ads occur within hours of each other,we’d expect them to receive similar levels of credit for any subsequent conversions, whereas if theyare separated by 30 days, we’d expect the later ad to receive more credit. How fast or steeply thisdecay occurs will depend on the individual advertiser and the type of ad, and should be learnedfrom the data, but our model should be flexible enough to treat ads differently based on not justtheir order, but when they occurred.We illustrate this with another toy example in Figure 2. To distinguish this issue from Require-ment ser 1 Ad Type 1 ConvUser 2 ConvUser 3 Ad Type 2 ConvUser 4 Ad Type 1 Ad Type 1Ad Type 1Ad Type 2 Figure 2: An example of the effects of the importance of considering event timestampsshown . This is a simplified example, but it nevertheless illustrates that a model that looks only atthe order of the ads and the binary conversion label can reach a very different conclusion comparedto a model that takes into account the times at which ads and conversions occur.Together these two requirements imply that we should model not just the sequence of ad eventsand binary (or even integer) conversion outcomes for each path, but rather the times when conver-sions occur given when ads occur. A natural option for this type of modeling is survival analysis.In this framework, conversions are viewed as occurrences (or failures) of a Poisson process. Theconversion intensity function, also known as the instantaneous occurrence rate or the hazard rateor just the intensity function, is allowed to vary with time, and in our case depends on the timesat which a user has previously seen ads. Attribution credit is then based on the effect of an ad ona user’s instantaneous conversion rate at the time of the conversion.Survival analysis techniques, including variations allowing for multiple occurrence, are widelyused, particularly in biostatistics (e.g. [3]). Indeed, in the case where a user can convert at mostonce, our model reduces to the classic Cox proportional hazards model with time-varying covariates[2]. Similar applications of survival analysis to multi-touch ad attribution have been considered by[13, 12] among others.In the next section we detail our model for conversions and how we use it for attribution. Our attribution system has two parts: a model for users’ conversion behavior, and an attributioncredit assignment algorithm that assigns credit in accordance with this model. We will focus thebulk of our exposition on the former. Given a model for user conversion behavior, there are manyreasonable credit assignment algorithms. While we will discuss options and the algorithm that weuse, the best choice depends on the goals of the system. As with the first example, these claims are all correlational. If instead Ad Type 2 were held back or ablated forusers 2 and 4, then these claims would become causal, but as before, the implications for our model are similar. .1 Modeling User Conversion Behavior As discussed in the previous section, in order to capture how ad effects vary over time, as well ashandle incomplete data, we will model conversions as a realization of a Poisson counting processwith a time-varying intensity function, λ ( t ). In particular, if we define Y i ( t ) as the number ofoccurrences (conversions) for user i up until time t , then Y i ( t ) − Y i ( s ) ∼ P oisson (cid:18)(cid:90) ts λ ( t )d t (cid:19) (1)for any 0 ≤ s ≤ t . We will use a log-linear model for the intensity, and allow it to depend onuser features (e.g. the country the user is located in), the time since previous ads were seen, aswell as other ad features (e.g., format of the ad). We will start with an overly simplified model for λ ( t ) and gradually add these complexities. This model formulation treats data from randomizedexperiments in the same way as observational data, but with an additional feature for treatmentgroup assignment. We will discuss specifics in a section towards the end. We will end the conversionmodeling section with a brief discussion on options for estimating this model. To start with, consider a model that has no user features, no ad features (other than time), andwhere we assume that each user sees at most one ad. This is an overly simplified view of the world,particularly the assumption of just a single ad per user, but it is useful for exposition. Supposethat the single ad occurs at t ; in principle this should also be indexed by user, but we will dropthe user subscripts for brevity. In this set-up, our model becomeslog( λ ( t )) = α + f ( t − t ) (2) α represents the log of the conversion rate before ads are shown, while f ( t − t ) represents theeffect of an ad on the user’s conversion rate. A user’s conversion intensity over time might thenlook as in Figure 3. The baseline rate before the ad is exp( α ), while after the ad the intensity isexp( α ) ∗ exp( f ( t − t )). Since this model is log-linear, ads have a multiplicative effect on users’conversion rates.In general, f is a function of time that we would like to estimate. We restrict ourselves tofunctions such that f ( x ) = 0 whenever x ≤
0, i.e. there is no ad effect before the ad is actuallyseen. There are many ways to parameterize f in order to do estimation; we will review a fewconcrete options, but many other parameterizations are also possible.One option is to treat f as a continuous function. Since we often expect the ad effect to decayrapidly, modeling f as a mixture of exponentials is a natural choice. We can consider a basis ofexponential decay functions { exp( − θ l t ) } l , where the choices of θ l determine the span of the basisand the speed of the decay. Then we can parameterize the intensity aslog( λ ( t )) = α + L (cid:88) l =1 β l exp( − θ l ( t − t )) (3)where β l are parameters to be estimated. The above equation is consistent with a proportionalhazards model and essentially implies a doubly exponential decay for λ ( t ). [9] and [13] proceed6 imeBaseline Ad Figure 3: Conversion intensity over time for a user with a single ad eventin a similar fashion, but set the right-hand side of the above equation equal to λ ( t ) rather thanlog( λ ( t )), essentially assuming an additive hazards model.Another option for estimating f as a continuous function is to use splines. Letting b , . . . , b L bethe functions in the spline basis, the model becomeslog( λ ( t )) = α + L (cid:88) l =1 β l b l ( t − t ) (4)We can then optimize for α and β , . . . , β L , possibly with a regularization constraint or prior onthe β l values. Splines are commonly used for approximating continuous functions. However, if thetrue intensity does decay very rapidly, then for a fixed basis size, an exponential function basismight perform better than a spline basis. We do not make a specific recommendation here; rather,this is an area for future study.A third option might be to approximate f as a step function, essentially estimating a separatead effect for each step. For example, we might choose to estimate the jump in conversions in thefirst 24 hours, the next 24 hours, and the subsequent 28 days. Then if t is measured in hours, andusing I to denote indicator functions, the log-intensity becomes:log( λ ( t )) = α + β I { t − t ≤ } + β I { < t − t ≤ } + β I { < t − t ≤ ∗ } (5)Again, we can now optimize for α , β , β , and β , perhaps placing a prior on the coefficients orapplying other types of regularization. As we add additional features to the model or if we wish topool data across advertisers, treating these coefficients as random effects may also be attractive.7 .1.2 Ad Features Staying for now in the single ad setting, we can also consider how to add ad features, such as theformat of the ad, whether it was shown on a mobile device, and so on, to the model. For example,perhaps certain ad formats or campaigns are associated with a larger change in conversions thanothers. To capture this, we want to allow the ad effect to vary depending on these features. Oneway to do this with K total features is to write:log( λ ( t )) = α + f ( t − t ) + K (cid:88) k =1 g k ( t − t , x k ) (6)Here k indexes the features in our model and x k is the value of the k th feature for the first ad. Like f , g k is a function of time to be estimated. One option is to constrain g k to be constant over time,so that changing the feature value simply shifts the intercept for f . Conceptually, this correspondsto a change in an ad’s initial effect, but not its decay rate. In the example above where f is a linearcombination of spline basis functions, we could take g k to be a linear combination of lower-orderspline basis functions. This would change both the ad’s initial effect and the effect’s decay rate.While we would generally expect to choose g k to be simpler than f , even this is not required.As written, if for each level of each feature k we allow a non-zero value for g k , the model isoverparameterized. This can be handled in the usual ways, e.g. setting g k = 0 for some referencelevel and interpreting f as the ad effect for when all ad features are at their reference level, or byadding a constraint so that the g k average to 0 for each k and interpreting f as the average adeffect. Alternatively, if we are applying regularization when we fit the model, then that may beenough for all the parameters to be well-defined. Our model is flexible with respect to these choicesand so we leave the notation general. The motivation for MTA is the case where users see multiple ads, making the models so far ofexpository, rather than practical, interest. Suppose now that a user sees multiple ads at times t , . . . , t J . As a starting point, we consider the following model:log( λ ( t )) = α + (cid:88) j f ( t − t j ) + (cid:88) j,k g k ( t − t j , x jk ) (7)where x jk is the feature value for the k th feature for the j th ad. At a high level, this model saysthat ads with identical features have the same decay curve, but ads with different feature valuesmay still have different decay curves. This is illustrated in Figure 4, which shows what an intensitycurve for a user who sees three ads, each with different features values, might look like.While this is not easily illustrated in a figure, notice that our model formulation also encodes thatthe effects of multiple ads are multiplicative (and therefore additive on a log-scale). This does notentirely match with our intuition: if showing a user an ad doubles their instantaneous conversion ratein the short-term, we wouldn’t expect showing them 50 ads within 10 minutes (perhaps with multipleads per site) to increase their conversion rate by a factor of 2 . Simultaneously, it’s intuitive thatthere might be a synergy between ads, such that seeing ad A might increase conversions by 2x,seeing ad B might increase conversions by 1.5x, but that seeing both ads increases conversions by5x rather than 3x. In other words, interactions between ads are possible and our model needs tobe flexible enough for the user to fit interaction effects, should they choose to.8 d Figure 4: Conversion intensity over time for a user with multiple ad eventsOne way to accomplish this is to generalize our notation slightly and allow x jk to depend notonly on ad j , but also on ads j (cid:48) < j , as well as the subscript value j itself. In other words, thefeatures can depend not only on the current ad, but on previous ads and the ad index.This allows us to add an effect for being the j th ad. For example, assuming for simplicity thatwe have no other ad features, we could define a single ad feature g ( t − t j , j ) = f j ( t − t j ) − f ( t − t j ),where f j is the effect function for the j th ad and f is the “default” or average ad effect function.Then our model formulation is equivalent to lettinglog( λ ( t )) = α + (cid:88) j f j ( t − t j ) (8)i.e. a model where the ad effect differs for each ad.Allowing the feature vector for an ad, x j = ( x j , . . . , x jK ), to depend on previous ads also allowsus to encode interactions between different ad formats or the timing of different ads. For example,we could add a feature I ( t j − t j (cid:48) < ∆ for j (cid:48) < j ) as an indicator for whether there was a precedingad in the previous ∆ hours. This can be desirable if we only want to consider ad interactions if thetwo ads are within ∆ hours of each other. Assuming for notational simplicity that this is the onlyfeature we model, then we havelog( λ ( t )) = α + (cid:88) j f ( t − t j ) + (cid:88) j g ( t − t j ) I ( t j − t j (cid:48) < ∆ for j (cid:48) < j ) (9)where, without loss of generality, we have set I ( t j − t j (cid:48) < ∆ for j (cid:48) < j ) = 0 as the “default” orreference level. 9he complexity of the model and any included interactions are necessarily application-dependent.In the case of an advertiser with a large amount of data and many ads shown per user, we may beable to make detailed estimates of the marginal effect of successive ads, as well as their interactions.On the other hand, when data is scarcer, it may be more practical to assume all ads have the sameeffect. The key point here is that this framework is flexible enough to allow for many differentspecifications of the factors that affect the conversion rate. User features that affect conversion rates can be handled in one of two ways. If the feature onlychanges the user’s overall conversion rate, but not how they react to ads, then we can treat it as ashift in α . Taking as an example the case of a model with a single user feature, a user’s bucketizedage, we write log( λ ( t )) = α + α age bucket + (cid:88) j f ( t − t j ) + (cid:88) j,k g k ( t − t j , x jk ) (10)If instead we believe that the user feature changes the ad effect, then we can incorporate it into themodel in the same way as any other ad feature, x jk , thinking of the feature as “age when this adwas shown.” Of course, unlike other ad features, which may vary amongst ads on the same path,this one will not, but that does not change how we write our model. As long as this feature variesacross users, it will still be estimable. So far we have considered a general model, which is applicable to all data and which, absentadditional assumptions or prior experimental data, estimates the correlational effect of ads onconversion intensity. However, as mentioned in the introduction, with experimental data we canmeasure the causal effect of ads on conversion rates. These can be incorporated into our model.The design of experiments measuring the causal effects of ads is highly dependent on the detailsof the media type and ad serving environment. For our discussion, we assume a generic design inwhich some ads are shown (“exposed”), while others are withheld (“unexposed”), but where westill log when an advertiser’s ad would have been shown had we not withheld it. We call the eventwhere we receive a request for an ad a query event, and an event where the ad is actually returnedan ad event. Thus when ads are shown, there are two simultaneous events, an ad event and a queryevent, while when ads are not shown there is a single query event.Then consider the modellog( λ ( t )) = α + α age bucket + (cid:88) j f ( t − t j ) I { ad j shown } + (cid:88) j,k g k ( t − t j , x jk ) I { ad j shown } + (cid:88) j m ( t − t j ) + (cid:88) j,k n k ( t − t j , x jk ) (11) m j ( t − t j ) = m ( t − t j ) + (cid:80) k n k ( t − t j , x jk ) represent the observed change in a user’s conversionrate (on a log scale) after the j th query. Note that as with the previous observational data, thisquery effect is not necessarily a causal effect: being targeted for an ad does not lead to you thenmaking a purchase. If you use a search engine to search for “sneakers”, you’re more likely to buysneakers after the query than a user who doesn’t do that search, but it probably wasn’t the querythat caused that difference. Rather, you were interested in sneakers, and then you did the search.10he ad effect, separate from any query effects, for the j th ad is given by f j ( t − t j ) = f ( t − t j ) + (cid:80) k g k ( t − t j , x jk ). This is the additional increase (on a log scale) in a user’s conversion intensity ifthe ad is actually shown. Figure 5 shows the intensity curves for a user who saw ads (solid line) anda user who had the same queries but for whom ads were withheld. Their difference would representthe ad effect. TimeBaseline Ad
Figure 5: Conversion intensity over time for a user who saw ads (solid line) vs user with the samequeries for whom ads were held back (dashed line).Whether our experiment is such that f j , the ad effect for a user’s j th query, is causally identifiabledepends on the details of our experiment, as well as how ads are served in our setting. In the genericdesign described above, the overall causal effect of all the ads subject to ablation, i.e. the numberof incremental conversions, is clearly identifiable. However, if there are multiple types of ads (e.g.comparing effects of first and second ad or search ads and display ads), then either changes in theexperimental design (e.g. randomly ablating subsets of ads for each user) or additional assumptions(e.g. order of ads doesn’t matter) may be needed for all the effects to be causally identifiable. Thedetails of this kind of experimental design are out of scope for this paper. However, in our experience,quite often some function of the f j has a valid causal estimator that can be derived from our model. So far the model has assumed that only ads, queries, and user features change the conversion rate.However, there are other factors which can also change conversion rates. In general, our model isflexible enough to accommodate many of these. 11ne common factor is seasonal effects, such as holidays, or more granular effects due to time ofday or day of week. Modeling these can be as simple as adding indicators for each hour of day, oras complex as fitting additional spline functions to capture the pattern.Another factor is that conversions themselves affect future rates of conversion. At one extremeare cases where only a single conversion is possible for each user, such as if the conversion representssigning up for a service or downloading an app. In those cases we can treat the model as a single-occurrence survival model and estimate it accordingly. In less extreme cases a conversion mightdecrease the likelihood of future conversions (e.g. someone buying a vacation package is probablyless likely to buy a second one right away), but it can also increase the likelihood of a conversion (e.g.someone buying clothes from a retailer might be more likely to purchase from them again in future).Thus treating conversions themselves as events that can change the post-conversion intensity mayimprove the model. This is done by [12], which models both ad clicks and conversions as mutuallyexciting Poisson processes, where the intensity of each process depends on both itself and the otherprocesses. While we may not want to model the ad clicks explicitly in our setting, we may considera similar approach of letting the conversion intensity depend on the number or timing of pastconversions.
There are different ways to estimate the intensity function of a Poisson process. If f, g k are eitherpiecewise constant or can be approximated as such, then by breaking each user’s path into intervalswhere the intensity, λ ( t ), is constant, we can treat each interval as an observation in a Poissonregression problem. The number of conversions in the interval is then the response, and the lengthof the interval is the offset.There is a large literature on fitting these types of large-scale regression problems, Poisson andotherwise. The details are outside the scope of this paper and we will not attempt to survey theexisting technologies, but point to [7] as one example of a practical solution. There the authors placea prior on the parameters and use Bayesian machine learning methods to estimate the parametersand their prior variances, essentially treating the parameters as random effects.If we do not want to assume that f, g k are (approximately) piecewise constant, then we can usethe likelihood for an inhomogeneous Poisson process directly. Suppose we observe users for the timeinterval [0 , τ ]. Let λ i ( t ) be the conversion intensity for user i at time t , and let T ij , j = 1 , ..., C i bethe conversion times for user i . The log-likelihood is N (cid:88) i =1 − (cid:90) τ λ i ( t )d t + C i (cid:88) j =1 log( λ i ( T ij )) (12)See [2] for a detailed derivation. In the case where f, g k (and therefore λ ) are piecewise constant,this reduces to the log-likelihood for the Poisson regression approach above. Without the piecewiseconstant assumption, one could instead try standard optimization techniques, such as gradientdescent, to estimate the parameters for f, g k . Even given a model for conversions, there are still many ways to distribute credit for the observedconversions to preceding ads. Different methodologies with different properties may be desirable12epending on the context. The method we propose, which we call backwards elimination, differsfrom existing methods in how it distributes credit for conversions (or changes in conversion intensity)that only occur because users were shown multiple ads. Our algorithm tends to give this credit tolater ads, while Shapley value-based methods, commonly used in the existing literature, divide thiscredit evenly amongst the ads. Both methods are reasonable, but they solve different problems, aswe will discuss later.We will first introduce backwards elimination in detail, including considerations for attributionwith experimental data. We then examine more closely how backwards elimination distributescredit due to synergies between ads and compare this to Shapley values.
To illustrate the backwards elimination algorithm, consider a user path with three ads followed bya conversion at time t ∗ , and whose estimated conversion intensity is plotted in Figure 6. TimeBaseline Ad
Conv
Contribution from Ad
Figure 6: Attributed contribution of each ad to a user conversion occurring at t ∗ .We define the contribution from the last ad before the conversion to be the difference in theestimated conversion intensity at t ∗ with all ads minus the estimated intensity at t ∗ if the last adis dropped. The contribution from the second-to-last ad is the difference in conversion intensity attime t ∗ if the last ad is dropped minus the intensity if the last two ads are dropped. More generally,we proceed backwards through the path, removing an additional ad and attributing credit to theremoved ad in proportion to the resulting change in intensity.More formally, let A ( n ) = { ( X j , t j ) : j = 1 , ..., n } denote the first n ads on user i ’s path, wherewe have dropped the index i for convenience, and where X j = ( x j , ..., x jk , ..., x jK ) is the vector13f ad features in our model. Without loss of generality, suppose that t n < t ∗ < t n +1 , i.e. thatthe conversion of interest occurs after the n th ad but before the n + 1 st ad. Let ˆ λ ( t, A ( j )) be theestimated conversion intensity at time t for a user who sees the ads in A ( j ). Then the raw creditfrom our algorithm is RawCredit ( j ) = ˆ λ ( t ∗ , A ( j )) − ˆ λ ( t ∗ , A ( j − A (0) = ∅ . We can also define the baseline credit as RawCredit ( baseline ) = ˆ λ ( t ∗ , ∅ ).Notice that the total raw credit given to ads equals ˆ λ ( t ∗ , A ( n )) − ˆ λ ( t ∗ , ∅ )), which in turn equals thedifference in the instantaneous conversion rate at t ∗ if all ads are dropped. This follows straight-forwardly from the telescoping nature of the formula above.As an example, suppose the estimated model for ˆ λ ( t ) is of the formlog(ˆ λ ( t )) = ˆ α + ˆ α age bucket + (cid:88) j ˆ f ( t − t j ) + (cid:88) j,k ˆ g k ( t − t j , x jk ) (14)Then for the l th ad, we havelog(ˆ λ ( t ∗ , A ( l ))) = ˆ α + ˆ α age bucket + l (cid:88) j =1 ˆ f ( t ∗ − t j ) + l (cid:88) j =1 (cid:88) k ˆ g k ( t ∗ − t j , x jk ) (15)With this model for successive ads ˆ λ ( t ∗ , A ( l )) will differ from ˆ λ ( t ∗ , A ( l − f ( t ∗ − t l ) + (cid:80) k ˆ g k ( t ∗ − t l , x lk )) so that RawCredit ( l ) = ˆ λ ( t ∗ , A ( l )) − ˆ λ ( t ∗ , A ( l − λ ( t ∗ , A ( l − (cid:32) exp (cid:32) ˆ f ( t ∗ − t l ) + (cid:88) k ˆ g k ( t ∗ − t l , x lk ) (cid:33) − (cid:33) (16)There are two natural ways to normalize the raw credit. We can consider either N ormalizedCredit ( j ) = ˆ λ ( t ∗ , A ( j )) − ˆ λ ( t ∗ , A ( j − λ ( t ∗ , A ( n )) (17)or N onBaselineN ormalizedCredit ( j ) = ˆ λ ( t ∗ , A ( j )) − ˆ λ ( t ∗ , A ( j − λ ( t ∗ , A ( n )) − ˆ λ ( t ∗ , ∅ ) (18)If we use N ormalizedCredit ( j ), and normalize RawCredit ( baseline ) in a similar way, then thetotal ad credit plus the baseline credit will equal the number of conversions. If we instead use N onBaselineN ormalizedCredit ( j ), then the total ad credit will equal the total number of conver-sions.It can be shown that, assuming our estimated intensity is correct, the total expected credit fromall conversion occurrences in the path is E [ N ormalizedCredit ( j )] = (cid:82) (cid:16) ˆ λ ( t, A ( j )) − ˆ λ ( t, A ( j − (cid:17) d t or the difference in the expected number of conversions for a user seeing the ads in A ( j ) versus auser seeing the ads in A ( j − j is the expected number of additional conversions gained when it was added to the end of the path,without considering gains due to combining ad j with later ads.14ne implication of this is that for a path with multiple ads, credit for “extra” conversions thatoccur because the ads are shown to the same user, rather than each ad being shown to differentusers, goes to the later ads, at least in expectation. In fact, this holds not just in expectation andis actually a property of the raw credit, not just the normalized credit. The details are in section4.2.3, where we give an example and compare how backwards elimination and Shapley values dividethese “extra” conversions. Suppose that we have built a model for ˆ λ ( t ) with experimental data and separate ad and queryeffects. In this case we want to attribute credit to ads based only on the increase in conversionintensity caused by ads and not the query effect. For concreteness, suppose the model is the sameas equation 11, i.e.:log( λ ( t )) = α + α age bucket + (cid:88) j f ( t − t j ) I { ad j shown } + (cid:88) j,k g k ( t − t j , x jk ) I { ad j shown } + (cid:88) j m ( t − t j ) + (cid:88) j,k n k ( t − t j , x jk ) (19)We define the attribution credit in the same way as before, removing a single ad at a time andobserving the change in ˆ λ ( t ∗ ), but now we keep all the query effects throughout. Assuming, asbefore, that there are exactly n ads before the conversion, we havelog(ˆ λ ( t ∗ , A ( l )) = ˆ α + ˆ α age bucket + l (cid:88) j =1 ˆ f ( t − t j ) I { ad j shown } + l (cid:88) j =1 (cid:88) k ˆ g k ( t − t j , x jk ) I { ad j shown } + n (cid:88) j =1 ˆ m ( t − t j ) + n (cid:88) j =1 (cid:88) k ˆ n k ( t − t j , x jk ) (20)That is, the summations for ˆ f and ˆ g k , the ad effects, consider only the ads in A ( l ), while thesummations for ˆ m and ˆ n k , the query effects, consider all of the n events preceding the conversion.The overall raw credit given to ads is therefore the difference between the estimated conversionintensity at t ∗ and the estimated conversion intensity for a counterfactual path with the samequeries, but where ads were always withheld. As with the non-incremental credit, it can be shownthat the expected value of the total normalized ad credit for each user equals the expected numberof incremental conversions (i.e. the expected difference in conversions with and without ads) forthat path. See the appendix for details. Thus while it may seem strange to have an ad’s creditdepend on features of later queries, this actually leads to the most intuitive value of the individualand total ad credit. We mentioned previously that backwards elimination assigns credit for conversions requiring mul-tiple ads to the last ad in the group. In this section, we make that more precise and compare howbackwards elimination and Shapley values divide this credit. We start by defining what we meanby an ad’s marginal effect as well as “conversions requiring multiple ads.”15efine the marginal credit of a set of one or more ads A to be m ( A ) = ˆ λ ( A ) − ˆ λ ( ∅ ), where wehave dropped the t ∗ for brevity. This marginal credit is exactly the difference in conversion intensityat conversion time with exactly these ads versus without any ads. When our algorithm starts, ithas m ( A ( n )) units of raw credit to distribute and when it gets to the j th ad it has m ( A ( j )) unitsof credit left to distribute, by definition. If we think of the algorithm as splitting credit betweentwo groups of ads, A ( j −
1) and the singleton j th ad, { A j } = { ( t j , X j ) } , then the credit given tothe earlier ads, A ( j − { A j } , is m ( A ( j )) − m ( A ( j − A ( j −
1) and { A j } as S ( A ( j − , A j ) = m ( A ( j )) − m ( A ( j − − m ( A j ) (21)i.e. the difference between the credit received by A j when seen after the ads in A ( j −
1) and themarginal ad credit for a path containing only { A j } . When the conversion intensity is superadditive,so that showing multiple ads leads to more conversions than the sum of the marginal increase inconversions from showing each ad individually, S ( A ( j − , A j ) is positive and backwards eliminationdistributes the extra synergy credit to the later ad. When the conversion intensity is subadditive,so that showing multiple ads leads to fewer conversions than the sum of the marginal increase inconversions from showing ad individually, S ( A ( j − , A j ) is negative, and the later ad gets creditsmaller than its marginal credit. Example:
Suppose that a user path has 2 ads, A , A , prior to a conversion and suppose thatˆ λ ( t ∗ , { A , A } ) = exp( f ( t ∗ − t ) + f ( t ∗ − t )), i.e. with no other interactions and with the baselinerate equal to 1. Suppose also that ˆ λ ( t ∗ , { A } ) = exp( f ( t ∗ − t )) = 2, ˆ λ ( t ∗ , { A } ) = exp( f ( t ∗ − t )) = 3, in other words the contribution from each ad is the same as if it were the only ad in the path.Then the overall raw ad credit for the 2-ad path will be m ( { A , A } ) = ˆ λ ( t ∗ ) − ˆ λ ( ∅ ) = 2 ∗ − m ( A ) = 1, m ( A ) = 2. The raw credit of these adsis RawCredit ( A ) = 2 ∗ − RawCredit ( A ) = 2 − m ( A ) = 2), plus the synergy between the first and second ads, S ( { A } , A ) = m ( { A , A } ) − m ( A ) − m ( A ) = 5 − − S ( { A } , A ) = (exp( f ( t ∗ − t )) −
1) (exp( f ( t ∗ − t )) −
1) (24)Assuming the f i are decreasing functions, and fixing both the conversion time, t ∗ , as well asthe time between the last ad and the conversion, t ∗ − t , then as the time between the ads, t − t , increases, the synergy will generally decrease, since t ∗ − t will increase, causing thefirst term to decrease. In other words, if the ads are far apart, their synergy will be smaller. In fact, the synergy can be defined more generally for an ad A and a set of ads A as S ( A , { A } ) = m ( A ∪ A ) − m ( A ) − m ( A ), but we will not require this more general definition.
16. While the synergy in the example is positive, negative synergy can occur when there is di-minishing or even zero marginal benefit to showing multiple ads, perhaps due to user adfatigue. As a numerical example, consider the case where ˆ λ ( t ∗ , { A } ) = 2 and ˆ λ ( t ∗ , { A } ) = 3as before, but ˆ λ ( t ∗ , { A , A } ) = 3. In other words, the fact that the user has seen multipleads doesn’t increase the intensity; only the timing of the last ad matters. This can be pa-rameterized by allowing a dependence between ads, as described in Equation 9. In this case RawCredit ( A ) = 1 as before, but RawCredit ( A ) = 3 − m ( A ),implying that the synergy is negative and that we would be better off showing the ads todifferent users rather than the same user.Giving all of this synergy or interaction credit to the last ad may seem unfair, since the additionalincrease or decrease in the conversion intensity only happens if all of the ads occur. However, recallthat one potential use case for attribution is as an input to bidding, leading to higher bids on adsthat drive more conversions. Giving the synergy to the last ad rather than the earlier ad reflectsour knowledge when we are bidding on the earlier ad. At that time, without either additionalassumptions or modeling, we do not know if this user will have future ad impressions for us to bidon, therefore one reasonable approach is to bid based on the marginal effect of this current ad,together with its synergies with past ads, while ignoring any synergies between the ad being bidon and future ads (which may or may not occur). Thus, in this application, backwards eliminationmight be desirable precisely because it gives all the synergy credit to the last ad.Other methodologies, such as the popular Shapley value method used in [4, 11, 6], split thissynergy evenly amongst the ads involved. Shapley values come from game theory and try to fairlydivide the payoff (increase in intensity) for a coalition (group of ads) amongst the players (individualads). It does this by giving each player credit proportional to its average marginal contribution toall possible coalitions (subsets of ads). More precisely, in a game with player set Ω , ( | Ω | = N ) andvalue function v , the payoff to player i is given by: φ j ( v ) = (cid:88) O ⊆ Ω \{ j } N ! | O | !( N − | O | − v ( O ∪ { j } ) − v ( O )] (25)In the case of ad attribution, the value function is v ( · ) = ˆ λ ( t ∗ , · ) (or equivalently, v ( A ) = ˆ λ ( t ∗ , A ) − ˆ λ ( t ∗ , ∅ ) for any set A ), while the set of players equals the set of all ads before the conversions, i.e.Ω = A ( n ). j is the index of the ad receiving credit.For a path with only a single ad, both backwards elimination and Shapley values will result inthe same credit assignment. Similarly, the total credit given to all ads (as opposed to the baseline),is also the same for both methods. However, if there are 2 or more ads with nonzero synergy, thecredit assignment will differ. Example:
Continuing with our previous example, it is straightforward that the ad credit for A will be ShapleyRawCredit ( A ) = 12 (cid:16) ˆ λ ( t ∗ , { A } ) − ˆ λ ( t ∗ , ∅ ) (cid:17) + 12 (cid:16) ˆ λ ( t ∗ , { A , A } ) − ˆ λ ( t ∗ , { A } ) (cid:17) = 2(26)Similarly, we will have ShapleyRawCredit ( A ) = 3 (27)This gives more credit to the first ads compared to the Backwards Elimination method. In fact, itscredit equals its marginal effect, 1, plus half of the synergy between the ads, 1 / ∗
2. Similarly, thecredit for the second ad equals its marginal effect plus half the synergy.17ore generally, it can be shown that for synergy requiring k ads, Shapley values gives each adadditional credit equal to 1 /k of that synergy. The proof is similar to Theorem 1 in [14]. For someapplications or media types, this strategy of dividing the synergy may be preferred. For example,when retrospectively examining data for a group of ad campaigns splitting the synergy may be amore desirable way to compare the relative contributions of the campaigns.While backwards elimination and Shapley values divide the synergy credit differently, bothdepend on an ad’s contribution to λ ( t ∗ ). In particular, this implies that given a suitable modelfor λ ( t ), both methods will satisfy Requirement . In that sense, both are reasonable choices. Given amodel for λ ( t ), the two methods can be compared based on the specific goals of the application. To evaluate this system, we use standard model fit metrics for a Poisson regression, such as thelog-likelihood or Poisson loss. There is also the prediction bias: predicted conversions / observedconversions - 1. We can also consider sliced versions of these metrics to aid in model comparisonfor feature selection.With experimental data, we can estimate the ground truth number of incremental conversionsthat are caused by ads without needing a model by simply comparing the number of conversionsin the exposed and unexposed groups. To evaluate our model, which is fit using both the exposedand unexposed users, we can compare this ground truth estimate to the predicted incrementalconversions obtained by comparing the model’s predicted number of conversions in the exposedand unexposed groups. If our goal is to correctly model incrementality, comparing the groundtruth and predicted versions of incremental conversions (i.e the difference between groups) maybe of greater interest than comparing the actual and predicted conversions in each group. As away of evaluating both our model and credit assignment methodology, we can also compare theground truth incremental conversions to the normalized version of the ad credit. To justify thiscomparison, recall that as discussed in Section 4.2.2, the expected value of the ad credit shouldequal the expected number of incremental conversions, assuming the model estimates are correct.The exact estimators for these metrics depends on the details of how the experiment is run,which in turn depend on the details of the media type and ad serving environment. For simplicityand specificity, suppose again that the experiment splits users for the duration of the experimentinto either an exposed group, which sees ads as normal, or an unexposed group, for whom ads arewithheld, but that we are able to observe the conversions of both groups. Then we can considerthe following ground truth metric, which we call incremental conversions per user:
ICP U = conversions in exposed groupusers in exposed group − conversions in unexposed groupusers in unexposed group (28)This is essentially the observed difference in the average conversion rate between exposed and Requirement λ ( t ), not the attributionmethodology. ICP T = conversions in exposed grouptotal observation time for users in exposed group − conversions in unexposed grouptotal observation time for users in unexposed group(29)where the total observation time is defined astotal observation time for users in X group = (cid:88) i : i ∈ X total time that user i is observed in the experiment(30)If all users are observed for the same length of time in the experiment, then ICPT differs from ICPUby a constant factor. However, if the average time that each user is observed for differs betweenexposed and unexposed, then the difference is more complicated and ICPT may be a more usefulmetric.We can also consider ICPE or incremental conversions per exposed conversion: ICP E = ICP U × users in exposed groupconversions in exposed group (31)This represents the proportion of conversions in the exposed group that are incremental, afternormalizing for any differences in the sizes of the exposed and unexposed groups. As with ICPTcompared to ICPU, if the average time that each user is observed for differs between exposed andunexposed, we may prefer a version of ICPE that accounts for this: ICP E (cid:48) = ICP T × total observation time for users in exposed groupconversions in exposed group (32)We can compare each of these with their counterparts predicted by the model, which we callPICPU (predicted incremental conversions per user) and PICPPE (predicted incremental conver-sions per predicted exposed conversion). Their definitions are the same, but with “conversions”replaced by “predicted conversions”. We can also evaluate our attribution methodology by con-sidering the attribution credit, normalized by ˆ λ ( t ∗ ), given to ads for conversions that occur in theexposed group. This metric, which we call AICPE (attributed incremental conversions per exposedconversion), represents the proportion of credit for exposed conversions that is given to ads and iscomparable to PICPPE and ICPE. Comparing PICPPE and AICPE against ICPE is therefore anappropriate validation metric for our model and attribution methodology, respectively. Confidenceintervals for these metrics can be obtained by bootstrapping over users. Alternatively, we couldincorporate the uncertainty over the model itself by bootstrapping or using a block jackknife torefit the model and compute the evaluation metrics per block and then compute the variability ofthis estimate over the blocks.For feature selection, it can be helpful to compare these metrics on various slices of users, howeverwe must be cautious in choosing features to slice on. In particular, we must avoid any confoundingbetween a user’s treatment assignment, slice value, and conversion outcome. For example, if adexposure increases the length of a user’s path, perhaps because seeing ads leads to increased activity,then slicing by the total number of queries in a path leads to incomparable sets of exposed andunexposed users. However slicing on user features that are unaffected by the experiment, such ascity or gender, is still valid and can be useful both for feature selection and understanding userbehavior. 19 Simulation Study
We consider several simulation scenarios and show that our system performs well in these. Ingeneral, simulating user paths and conversion behavior requires strong assumptions about the gen-erating process for both ad exposures and conversions. These assumptions are necessarily muchsimpler than actual user behavior. Nevertheless, simulations can still provide confidence by showingthat the model is working as expected in these cases.For each of the scenarios below, we simulate 500 distinct data sets, each with 1 million usersand 30 days of data per user. We report the average coefficient estimates across these 500 datasets, and use the 0.025 and 0.975 quantiles across data sets to construct 95% confidence intervals.
Here we simulate exactly one ad per user, with the ad occurrence time being uniform in our 30 dayobservation window, i.e. [0 , . λ ( t )) = α + β I { t − t ≤ } + β I { < t − t ≤ } + β I { < t − t ≤ } (33)where we have rescaled time to be measured in days rather than hours.Since we are assuming a piecewise constant ad effect, the conversion intensity is piecewiseconstant overall and so we can simulate from it by simulating from a Poisson distribution withappropriate mean and offset. We then use this data to fit the model above and repeat this for eachof the 500 simulated datasets, resulting in 500 model estimates, each fit on an independent set of 1million users. The table shows the true values used in our simulations, together with the parameterestimates (averaged over the 500 repetitions) and CI’s (from the observed quantiles across 500repetitions). Ground Truth Mean(across 500 datasets) CI[p2.5, p97.5](across 500 datasets)Baseline (per day)[exp( α )] 0.0333 0.0333 [0.0332, 0.0334]Short term[exp( β )] 2.0 2.000 [1.983, 2.017]Medium term[exp( β )] 1.5 1.498 [1.485, 1.583]Long term[exp( β )] 1.2 1.200 [1.195, 1.204]Table 1: Ground truth and estimated model coefficients for Scenario 1.Using our attribution methodology, and normalizing by λ ( t ∗ ), we get a mean AICPE of 11.94%[11.76%, 12.21%], with the remaining credit going to the baseline. This model was not fit onsimulated experimental data and no query effect was fit, so there is technically no ICPE to compare20his to. However, we can compare it to what the ICPE would be if we assumed there was no queryeffect and we simulated additional users who have an ad query but don’t see an ad, i.e. whoseconversion intensity is equal to the baseline for the entire observation window. This is equivalentto an experimental model where there is no query effect: i.e. when the observed and incrementaleffect of ads is the same. The corresponding ICPE for these users is 11.95% [11.83%, 12.06%]. In this scenario we introduce a second type of ad for which there is no long-term effect. In particular,we assume that this second type of ad increases the conversion intensity by a factor of 1.5x on thefirst day, 1.2x on the second day and 1.0x (or no increase) after that.We will simulate each user as having exactly 2 ads, 1 of each type. The occurrence times foreach ad are independent of each other, with each being uniform in [0 , λ ( t )) = α + β I { < t − t ad type 1 ≤ } + β I { < t − t ad type 1 ≤ } + β I { < t − t ad type 1 ≤ } + β I { < t − t ad type 2 ≤ } + β I { < t − t ad type 2 ≤ } + β I { < t − t ad type 2 ≤ } (34)where t ad type i is the occurrence time for the ad of type i . The table shows the true values used inour simulations, together with the parameter estimates and CIs, derived as before.Ground Truth Mean(across 500 datasets) CI[p2.5, p97.5](across 500 datasets)Baseline (per day)[exp( α )] 0.0333 0.0333 [0.0332, 0.0334]Ad Type 1 Short term[exp( β )] 2.0 2.000 [1.983, 2.016]Medium term[exp( β )] 1.5 1.498 [1.483, 1.512]Long term[exp( β )] 1.2 1.200 [1.195, 1.204]Ad Type 2 Short term[exp( β )] 1.5 1.498 [1.485, 1.511]Medium term[exp( β )] 1.2 1.199 [1.188, 1.210]Long term[exp( β )] 1.0 1.000 [1.000, 1.002]Table 2: Ground truth and estimated model coefficients for Scenario 2.Using our attribution methodology, and normalizing by λ ( t ∗ ), we get a mean AICPE of 13.90%[13.72%, 14.09%], with the remaining credit going to the baseline. As before, the model was notfit on simulated experimental data and no query effect was fit, so there is technically no ICPEto compare to. However, we can again compare to what the ICPE would be if we assumed therewas no query effect and simulated additional users who have an ad query but don’t see an ad and21hose conversion intensity is therefore equal to the baseline for the entire observation window. Thecorresponding ICPE is then 13.91% [13.79%, 14.02%]. In this scenario we simulate the second ad type as having no effect on user’s conversion rates. Thiscould occur for public service announcements or similar types of ads. The other simulation detailsand model formulation are as in Scenario 2. The table below shows the true values used in oursimulations, together with the parameter estimates and CI’s, derived as before.Ground Truth Mean(across 500 datasets) CI[p2.5, p97.5](across 500 datasets)Baseline (per day)[exp( α )] 0.0333 0.0333 [0.0332, 0.0334]Ad Type 1 Short term[exp( β )] 2.0 2.000 [1.983, 2.017]Medium term[exp( β )] 1.5 1.498 [1.483, 1.512]Long term[exp( β )] 1.2 1.200 [1.195, 1.204]Ad Type 2 Short term[exp( β )] 1.0 1.000 [0.992, 1.007]Medium term[exp( β )] 1.0 1.000 [0.993, 1.007]Long term[exp( β )] 1.0 1.000 [1.000, 1.003]Table 3: Ground truth and estimated model coefficients for Scenario 3.Using our attribution methodology, and normalizing by λ ( t ∗ ), we get a mean AICPE of 11.94%[11.73%, 12.12%], with the remaining credit going to the baseline. This is nearly the same as inScenario 1, which is expected since the second ad type has no effect. As before, the model wasnot fit on simulated experimental data, so there is technically no ICPE to compare to, but wecan compare to what the ICPE would be if we assumed there was no query effect and simulatedadditional users who have an ad query but don’t see an ad. The corresponding ICPE is then 11.95%[11.84%, 12.07%]. In this scenario we use the same two types of ads as in Scenario 2, but allow the number of ads peruser to vary. We also allow more freedom in the model we fit than necessary to describe the datagenerating process for the simulations. The details are below.We allow the number of ads per user to be either 1, 2, or 3. The probability of each is proportionalto the probability of the corresponding number of events for Poisson random variable with mean 2.Equivalently, we can think of the events for each user as being a Poisson(2) random variable that isclipped to be between 1 and 3. On average, 40.6% of the users have a single ad in the path, 27.1%22ave 2 ads, and 32.3% have 3 ads. Each ad is equally likely to be of either type. So conditional ona user seeing 3 ads total, they may see 0, 1, 2 or 3 ads of each type, as long as the overall numberof ads is 3. The ad occurrence times are still independent and uniform on our 30 day observationwindow, i.e. on [0, 30].Recall that ad type 1 increased the conversion intensity by a factor of 2x on the first day, 1.5xon the second day, and 1.2x after that, while ad type 2 increased the conversion intensity by afactor of 1.5x on the first day, 1.2x on the second day and 1.0x (no increase) after that. We assumethat the effect of a second or third ad of the same type is the same as the effect of the first ad. Forexample, if a user sees an ad of type 1 at time 0 and a second ad of type 1 one hour later, then theuser’s ad intensity will be 2*2 = 4x higher than if they had seen no ads at all. More generally, wesimulate the intensity aslog( λ ( t )) = α + β { i : 0 < t − t i ≤ , ad type( i ) = 1 } + β { i : 1 < t − t i ≤ , ad type( i ) = 1 } + β { i : 2 < t − t i ≤ , ad type( i ) = 1 } + β { i : 0 < t − t i ≤ , ad type( i ) = 2 } + β { i : 1 < t − t i ≤ , ad type( i ) = 2 } + β { i : 2 < t − t i ≤ , ad type( i ) = 2 } (35)where { i : conditions } denotes the number of ads i satisfying the condition. Equivalently wecould writelog( λ ( t )) = α + (cid:88) i :ad type( i )=1 (cid:2) β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤ } (cid:3) + (cid:88) i :ad type( i )=2 (cid:2) β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤ } (cid:3) (36)However, when we fit the model, we do not assume that the second and third ad of the sametype have the same effect as the first. Instead we allow for more degrees of freedom. For this sectionwe will use γ to denote the coefficients estimated by the model, while continuing to use β for thetrue coefficients used to generate the simulation data. Then the model we fit islog( λ ( t )) = α + (cid:88) k =1 (cid:104) γ k I { exactly k type 1 ads with 0 < t − t i ≤ } + γ k I { exactly k type 1 ads with 1 < t − t i ≤ } + γ k I { exactly k type 1 ads with 2 < t − t i ≤ } (cid:105) + (cid:88) k =1 (cid:104) γ k I { exactly k type 2 ads with 0 < t − t i ≤ } + γ k I { exactly k type 2 ads with 1 < t − t i ≤ } + γ k I { exactly k type 2 ads with 2 < t − t i ≤ } (cid:105) (37)For each γ ijk , i indexes the ad type (1 or 2), j indexes the time interval or term (1 for (0, 1] (shortterm), 2 for (1, 2] (medium term), and 3 for (2, 30] (long term)), while k indexes the exact numberof ads of type i in interval type j that is under consideration. In particular, it is straightforward23o see that the ground truth value for γ k is k ∗ β or on the original scale exp( γ k ) should equalexp( k ∗ β ). Similarly, γ k corresponds to k ∗ β or on the original scale exp( γ k ) corresponds toexp( k ∗ β ).As before, we simulate 500 independent data sets, each with 1 million users, and fit modelsindependently on each data set. The table below shows the average parameter estimate across the500 data sets as well as the CI, derived by taking the 2.5% and 97.5% quantiles of the estimates acrossthe 500 data sets. The first three columns specify the parameter being estimated, corresponding to i , j , and k in γ ijk . The fourth column gives the ground truth, giving both the numerical value andthe formulation in terms of the β i . The last two columns are the mean and CI.We notice that • For the short and medium term effects in the 3 ad case (3 ads in a term), the averageparameter estimate can be quite far from the truth, and the confidence interval can be quitewide, especially relative to previous simulations. • For the short and medium term effects in the 2 ad case, the average parameter estimate isreasonably accurate, but the CI’s are wider than before. • For the long term effects in the 2 and 3 ad cases, the average parameter estimates are reason-ably accurate and the CIs are only slightly wider than before. • For short, medium, and long term effects, performance in the 1 ad case is comparable toprevious simulations.These results can be explained by the relative sums of the interval lengths for which the ad effectsare active. In the remainder of this section we will give numerical results from the simulation aswell as intuition. Readers who are not interested in the details may skip to the next section.The model fit in this scenario does not assume that the effect of the second or third ad is thesame as the first. As a result, each γ ijk is estimated based only on data from the sections of userpaths where it is active (i.e. where the corresponding indicator function is not 0). The amountof data can be quantified by summing up the interval lengths (i.e. summing the offsets where thefeature is active). The table below gives the average (across the 500 data sets) interval length,rounded to the nearest day, that each γ ijk is active. Since the two ad types are symmetric in thisrespect, their results are quite similar - we show them side by side to save space.The interval lengths are in line with the observations we made earlier. Intuitively, we can thinkof these interval lengths as depending on two factors: frequency (number of users) for which γ ijk isactive and the length of time per user that it is active.First consider the frequency, or the number of users for which γ ijk is active. First recall thatusers have a 32% chance of having 3 ads and a 27% chance of having 2 ads total. Moreover, giventhat users have 3 (resp 2) ads total, the chance that they are all of the same type is only 1/4 (resp1/2). Consider first the long-term effects of having multiple ads of the same type. We focus on theeffect of having 3 ads of the same type: it is straightforward to see that this is the rarest of thethree, and the computations for the case of 2 ads of the same type are more involved, since theymust consider paths with both 2 ads total and 3 ads total. For the 3 ad case, we might expect allof the users with 3 ads of the same type (or 1 million * 0.32 * 1/4 = 80K users) to eventually havethe long-term effect of 3 ads of the same type be active. In fact, the average number of users witheither γ or γ active is 65K, since we must account for the cases where the third ad occurs inthe last two days of the observation period, in which case there is no point in the user path where24arameter to EstimateAd type Time Term α )] 0.0333 0.0333 [0.0332, 0.0334]Ad Type 1 Short Term 1 ad[exp( γ )] 2.0[exp( β )] 1.999 [1.983, 2.015]2 ads[exp( γ )] 4.0[exp(2 β )] 3.984 [3.784, 4.183]3 ads[exp( γ )] 8.0[exp(3 β )] 4.906 [2.523, 7.746]Medium Term 1 ad[exp( γ )] 1.5[exp( β )] 1.498 [1.485, 1.512]2 ads[exp( γ )] 2.25[exp(2 β )] 2.213 [2.059, 2.386]3 ads[exp( γ )] 3.375[exp(3 β )] 1.850 [0.666, 3.676]Long Term 1 ad[exp( γ )] 1.2[exp( β )] 1.200 [1.196, 1.205]2 ads[exp( γ )] 1.44[exp(2 β )] 1.438 [1.429, 1.449]3 ads[exp( γ )] 1.728[exp(3 β )] 1.724 [1.692, 1.755]Ad Type 2 Short Term 1 ad[exp( γ )] 1.5[exp( β )] 1.502 [1.488, 1.516]2 ads[exp( γ )] 2.25[exp(2 β )] 2.270 [2.109, 2.430]3 ads[exp( γ )] 3.375[exp(3 β )] 2.138 [0.735, 4.346]Medium Term 1 ad[exp( γ )] 1.2[exp( β )] 1.201 [1.189, 1.211]2 ads[exp( γ )] 1.44[exp(2 β )] 1.427 [1.300, 1.548]3 ads[exp( γ )] 1.728[exp(3 β )] 1.130 [0.876, 1.958]Long Term 1 ad[exp( γ )] 1.0[exp( β )] 1.000 [0.998, 1.003]2 ads[exp( γ )] 1.0[exp(2 β )] 1.000 [0.998, 1.001]3 ads[exp( γ )] 1.0[exp(3 β )] 1.000 [0.998, 1.000]Table 4: Ground truth and estimated model coefficients for Scenario 4.25arameter to Estimate Ad Type 1Offsets (days) Ad Type 2Offsets (days)Time Term γ i )] 922,567 922,5282 ads[exp( γ i )] 9,972 9,9773 ads[exp( γ i )] 44 44Medium Term 1 ad[exp( γ i )] 891,290 891,2572 ads[exp( γ i )] 9,633 9,6363 ads[exp( γ i )] 43 42Long Term 1 ad[exp( γ i )] 8,173,729 8,172,8532 ads[exp( γ i )] 1,831,242 1,831,5663 ads[exp( γ i )] 229,930 230,007Table 5: Average observed offset length for model coefficients in Scenario 4.the long-term effect of three ads is active. Nevertheless, the take-away is that a fairly large numberof users will at some point in their path have active the long term effects of having multiple ads.By contrast, the frequency of short and medium term effects is much lower. Not only does theuser need to have 2 or 3 ads of the same type, but they must occur within a 24 hour period. Thisturns out to be fairly rare: on average, out of 1 million users there are approximately 262 userswith 3 ads of one type within 24 hours and approximately 80K users with 2 ads of one type within24 hours of each other, again split evenly amongst the two types .Now consider the length of time for which the ad effects are active. The short and mediumterm effects of ads, regardless of the number of ads, can last at most 24 hours per ad, by definition.Depending on the number of ads in a user’s path, they can have at most 3 x 24hour intervals wherethe short or medium term effects of a single ad are active or 1 x 24 hour interval where the shortor medium term effects of 2 or 3 ads are active. However in the 2 (resp 3) ad cases, the intervalwill generally be shorter than 24 hours, since for e.g. the short term, it will start at the time of thesecond (resp third) ad and end 24 hours after the first ad occurred . The effect of having multipleads in the medium term will similarly be based on the same interval but shifted forward by 24hours - although it can be shorter if it is cut off by the end of the 30 day observation window. Bycontrast, the intervals for the long term effects can be much longer, up to 28 days long, with thelength depending on the position of the ads in the path. While we still expect intervals for the In general, users with 3 ads of the same type within 24 hours will also have a brief window where the short ormedium term effect of having 2 ads of the same type will also be active. However, at 262 such users, this does notaffect the frequency much Unless a third ad of the same type occurs within 24 hours of the first, in which case the corresponding intervalfor the effect of two ads of the same type will be shorter. λ ( t ∗ ), we get a mean AICPE of 13.87% [13.71%,14.02%], with the remaining credit going to the baseline. While this is similar to Scenario 2, this ismostly a numerical coincidence: for any particular user, they may have more or fewer ads of eachtype relative to Scenario 2, and so the attributed ad credit for a user could go in either direction.It just happens to average out to a similar value. As before, the model was not fit on simulatedexperimental data, so there is technically no ICPE to compare to, but we can compare to whatthe ICPE would be if we assumed there was no query effect and simulated additional users whohave an ad query but don’t see an ad. The corresponding ICPE is then 13.87% [13.75%, 13.99%].We can see that despite the fact that we have trouble estimating some of the rarer coefficients, theoverall ad credit is still quite close to the ICPE, i.e. its accuracy is not much worse on average. We have presented a data-driven attribution system based on estimating the effect of ads on auser’s conversion rate per unit of time. This system satisfies our previously outlined requirements,namely that it can handle incomplete or censored data as well as take into account the times aswhich ads occur, not just their order, when assigning credit. These features of our system make itappropriate for use in real-time bidding, although the details are beyond the scope of this papergiven the many application-specific considerations. We have also discussed some examples of howto use covariates to model the conversion intensity over time, although again the detailed choice ofcovariates is highly application-dependent. In this section, we will discuss two areas of potentialfuture work. The first relates to the relationship between ads, while the second concerns the effectof modeling assumptions on attribution outcomes.Thus far, we have assumed that ads are independent of each other, in addition to the standardPoisson process assumption that conversions are independent of each other. However, for sometypes of media seeing an ad can lead to a user seeing more ads in the future. This can be a directresult of a user’s interactions with an ad, such a click on a display ad that then leads to the userseeing further related ads. It can also occur as a result of more indirect interactions, such as whenseeing an ad prompts a user to search for a related term, leading to them seeing additional relatedsearch ads. In these cases, to the extent that the later ads are caused by the earlier ads, some ofthe credit for the later ads should arguably be redistributed to the earlier ads. Our system doesnot currently account for these effects when allocating credit. One way to remedy this might beby having a multi-stage model, where we first use all the ads to predict conversions. Consider theearlier example where display ads can cause search ads. Suppose that in our original model, thesearch ad gets 0.6 credit and the display ad gets 0.4. We can fit a second model where we use27isplay ads as the events and search ads (rather than conversions) as the response. We can thenallocate credit for the search ad between the preceding display ads and a baseline. If, for example,in this second model 30% of the credit for the search ad occurring goes to the display ad, with theremaining 70% going to the baseline, then we can reallocate 30% of the conversion credit attributedto the search ad in the first model to the display ad, keeping 70% of the credit with the search ad.As a result, the total credit for the display ad would be 0 . . ∗ . .
58 and the credit for thesearch ad would be 0 . ∗ . .
42. In practice however, this requires a rich dataset in order to beable to detect these interactions and thus may not always be possible.While we have discussed a few examples of features and structure for modeling λ ( t ), we have notdiscussed in detail how the model structure influences attribution results. Consider the simplisticmodel from our first simulation, generalized for the case where there are multiple ads in the path:log( λ ( t )) = α + β I { < t − t ≤ } + β I { < t − t ≤ } + β I { < t − t ≤
30 for some i } (38)If there were two ads in the 24 hours before a conversion, the second ad would not change theestimated conversion intensity at conversion time and would therefore get 0 credit, i.e. if weassume there are only two ads in the path, ˆ λ ( t ∗ , A (2)) − ˆ λ ( t ∗ , A (1)) = 0. If we find it unrealistic todistribute all the credit to the first ad rather than the second, then in this particular case we cansimply model the marginal effects of an additional ad in each time bucket, similar to the model insimulation scenario 4, except that rather than modeling the effect of k ads, we model the changein the intensity when we go from k − k ads:log( λ ( t )) = α + β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤ } + β I { < t − t i ≤
30 for two distinct values of i } (39)More generally, we can consider a model where we estimate a separate ad decay function for eachad as in Equation 8: log( λ ( t )) = α + (cid:88) j f j ( t − t j ) (40)However, we cannot model an infinite number of ads in each term. At some point either datasparsity or regularization will lead to some of the later ad effects being estimated as 0. As a result,in some cases the last ads before a conversion may get 0 credit. One alternative to avoid this is topool data and estimate some ads as having the same effect. This can be done by assuming that allads j > j (cid:48) have the same effect:log( λ ( t )) = α + j (cid:48) (cid:88) j =1 f j ( t − t j ) + J (cid:88) j = j (cid:48) f j (cid:48) ( t − t j ) (41)Or we could pool in a more sophisticated way, by e.g. supposing that an ad with no other ads inthe preceding X days “resets” the counter and has the same effect as the very first ad. This poolingmakes it more likely that the last ads before a conversion will get non-zero attribution credit.28he larger point here is that the model structure and the way in which we pool data to estimatethe different ad decay curves has repercussions on the results of the attribution algorithm. Wecan of course use overall model fit metrics to guide our choices, but this points to the necessity ofconsidering a wide range of models. Care may also be needed, since attribution results may differmost for long paths, where we most care about MTA, while overall model fit may be comparable,if long paths are relatively rare. As with most modeling choices, the right approach depends on thecontext and requires careful consideration from the modeller. Acknowledgements
The authors would like to thank both past and present members of the TEDDA team at Googlefor their contributions to developing this methodology, as well as members from other Ads teamsfor their tremendous support.
References [1] Eva Anderl, Ingo Becker, Florian V Wangenheim, and Jan Hendrik Schumann. Mapping thecustomer journey: A graph-based framework for online attribution modeling.
Available atSSRN 2343077 , 2014.[2] Per K Andersen, Ornulf Borgan, Richard D Gill, and Niels Keiding.
Statistical models basedon counting processes . Springer Science & Business Media, 2012.[3] Richard J Cook and Jerald Lawless.
The statistical analysis of recurrent events . SpringerScience & Business Media, 2007.[4] Brian Dalessandro, Claudia Perlich, Ori Stitelman, and Foster Provost. Causally motivatedattribution for online advertising. In
Proceedings of the sixth international workshop on datamining for online advertising and internet economy , pages 1–9, 2012.[5] Evert De Haan, Thorsten Wiesel, and Koen Pauwels. The effectiveness of different formsof online advertising for purchase conversion in a multiple-channel attribution framework.
International Journal of Research in Marketing , 33(3):491–507, 2016.[6] Ruihuan Du, Yu Zhong, Harikesh Nair, Bo Cui, and Ruyang Shou. Causally driven incrementalmulti touch attribution using a recurrent neural network, 2019.[7] Nicholas A. Johnson, Frank O. Kuehnel, and Ali Nasiri Amini. A scalable blocked gibbssampling algorithm for gaussian and poisson regression models, 2016.[8] Pavel Kireyev, Koen Pauwels, and Sunil Gupta. Do display ads influence search? attributionand dynamics in online advertising.
International Journal of Research in Marketing , 33(3):475–490, 2016.[9] Randall A. Lewis and Jeffrey Wong. Incrementality bidding & attribution. Available at SSRN:https://ssrn.com/abstract=3129350 or http://dx.doi.org/10.2139/ssrn.3129350.2910] Hongshuang Li and PK Kannan. Attributing conversions in a multichannel online marketingenvironment: An empirical model and a field experiment.
Journal of Marketing Research ,51(1):40–56, 2014.[11] Xuhui Shao and Lexin Li. Data-driven multi-touch attribution models. In
Proceedings of the17th ACM SIGKDD international conference on Knowledge discovery and data mining , pages258–264, 2011.[12] Lizhen Xu, Jason A Duan, and Andrew Whinston. Path to purchase: A mutually exciting pointprocess model for online advertising and conversion.
Management Science , 60(6):1392–1412,2014.[13] Ya Zhang, Yi Wei, and Jianbiao Ren. Multi-touch attribution in online advertising with survivaltheory. In , pages 687–696. IEEE, 2014.[14] Kaifeng Zhao, Seyed Hanif Mahboobi, and Saeed R. Bagheri. Shapley value methods forattribution modeling in online advertising, 2018.
A Appendix: Expected Value of
N ormalizedC redit ( j ) In section 4.2.1 we claim that assuming that our estimated intensity is correct, we have E [ N ormalizedCredit ( j )] = (cid:90) ˆ λ ( t, A ( j )) − ˆ λ ( t, A ( j − t (42)where the expectation is over all conversion occurrences in the path. Recall that N ormalizedCredit ( j ) = ˆ λ ( t ∗ , A ( j )) − ˆ λ ( t ∗ , A ( j − λ ( t ∗ , A ( n )) (43)where without loss of generality, n was taken to be the number of ads at or before time t ∗ . Thenwe have E [ N ormalizedCredit ( j )] = E (cid:34) ˆ λ ( t ∗ , A ( j )) − ˆ λ ( t ∗ , A ( j − λ ( t ∗ , A ( n )) (cid:35) = (cid:90) ∞ λ ( t, A ( j )) − λ ( t, A ( j − λ ( t ) d Y ( t ) (44)Where Y ( t ) is the measure for the occurrences in the Poisson process (i.e. the conversion occur-rences). Then we have = (cid:90) ∞ λ ( t, A ( j )) − λ ( t, A ( j − λ ( t ) λ ( t )d t (45)= (cid:90) ∞ t j λ ( t, A ( j )) − λ ( t, A ( j − t (46)as claimed. The last equality follows because the intensities of a path with the first j − j ads after the j th ad has occurred.30or non-incrementality models, the integral above can be computed (replacing λ with its estimateˆ λ ) as long as the features and times of the first j ads are known, since λ ( t, A ( j )) does not dependon events after t j . However, as discussed in Section 4.2.2, for incrementality models, λ ( t, A ( j ))can depend on all query (i.e. non-ad) events between t j and t , in addition to all ad events upuntil t j . Therefore the integral can only be computed if all query events (but not necessarily adevents) that ever occur are known. More practically, for t much larger than t j , we would expectthe contribution from the j th ad to be effectively 0. In practice, it should suffice to assume thatquery events are known as long as λ ( t, A ( j )) − λ ( t, A ( j −−