[PDF] Capturing Delayed Feedback in Conversion Rate Prediction via Elapsed-Time Sampling

Abstract

Conversion rate (CVR) prediction is one of the most critical tasks for digital display advertising. Commercial systems often require to update models in an online learning manner to catch up with the evolving data distribution. However, conversions usually do not happen immediately after a user click. This may result in inaccurate labeling, which is called delayed feedback problem. In previous studies, delayed feedback problem is handled either by waiting positive label for a long period of time, or by consuming the negative sample on its arrival and then insert a positive duplicate when a conversion happens later. Indeed, there is a trade-off between waiting for more accurate labels and utilizing fresh data, which is not considered in existing works. To strike a balance in this trade-off, we propose Elapsed-Time Sampling Delayed Feedback Model (ES-DFM), which models the relationship between the observed conversion distribution and the true conversion distribution. Then we optimize the expectation of true conversion distribution via importance sampling under the elapsed-time sampling distribution. We further estimate the importance weight for each instance, which is used as the weight of loss function in CVR prediction. To demonstrate the effectiveness of ES-DFM, we conduct extensive experiments on a public data and a private industrial dataset. Experimental results confirm that our method consistently outperforms the previous state-of-the-art results.

Full PDF

CCapturing Delayed Feedback in Conversion Rate Predictionvia Elapsed-Time Sampling

Jia-Qi Yang , Xiang Li , Shuguang Han , Tao Zhuang De-Chuan Zhan , Xiaoyi Zeng , Bin Tong State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Alibaba Group, Hangzhou, China { yangjq,zhandc } @lamda.nju.edu.cn { leo.lx,shuguang.sh,zhuangtao.zt,yuanhan,tongbin.tb } @alibaba-inc.com Abstract

Conversion rate (CVR) prediction is one of the most criticaltasks for digital display advertising. Commercial systems of-ten require to update models in an online learning manner tocatch up with the evolving data distribution. However, con-versions usually do not happen immediately after user clicks.This may result in inaccurate labeling, which is called de-layed feedback problem. In previous studies, delayed feed-back problem is handled either by waiting positive label for along period of time, or by consuming the negative sample onits arrival and then insert a positive duplicate when conver-sion happens later. Indeed, there is a trade-off between wait-ing for more accurate labels and utilizing fresh data, whichis not considered in existing works. To strike a balance inthis trade-off, we propose Elapsed-Time Sampling DelayedFeedback Model (ES-DFM), which models the relationshipbetween the observed conversion distribution and the trueconversion distribution. Then we optimize the expectation of true conversion distribution via importance sampling underthe elapsed-time sampling distribution. We further estimatethe importance weight for each instance, which is used as theweight of loss function in CVR prediction. To demonstratethe effectiveness of ES-DFM, we conduct extensive experi-ments on a public data and a private industrial dataset. Exper-imental results conﬁrm that our method consistently outper-forms the previous state-of-the-art results.

Introduction

Digital display advertising has become the main businessmodel for many online services, in which advertisers payfor placing ads on those platforms. Among the availablepayment options, paying per conversion (CPA) is usuallythe dominated mechanism as conversions can directly bringproﬁts. In the CPA model, advertisers pay only if users per-formed certain predeﬁned conversion actions with the adver-tisement. To effectively display ads, machine learning mod-els have been widely adopted to forecast the conversion rate(CVR), which is widely investigated in both academia and * Jia-Qi Yang performed this work as an intern at Alibaba. † industry (Lee et al. 2012; Chapelle, Manavoglu, and Rosales2014; Ma et al. 2018).In order to capture the dynamic change of user needs,commercial systems often update learned models with up-to-date data within a short time, i.e., in an online trainingmanner (Jugovac, Jannach, and Karimi 2018; Guo et al.2019; Ktena et al. 2019). This further complicates the CVRprediction since conversions usually do not happen imme-diately after user clicks. The Delayed Feedback issue intro-duces a dilemma for streaming CVR prediction — on theone hand, we need to wait for a sufﬁcient long time so thatthe observation information can approximately reﬂect thetrue conversion; on the other hand, we also tend to updatethe learned models without much delay for model-freshness.Chapelle (2014) was among the early studies to addressthe delayed feedback problem. The proposed Delayed Feed-back Model (DFM) optimizes CVR as a joint probabilityover the predicted CVR and the delayed time distribution.This joint probability is estimated in the observed time in-terval, which may be biased from the true conversion distri-bution. The biased CVR is probably more inaccurate due tothe delayed feedback problem in online learning settings.To achieve unbiased CVR estimation in delayed feedbackproblem, recent studies have explored the way of optimiz-ing the expectation of true conversion distribution via impor-tance sampling (Bishop 2007). Ktena et al. (2019) proposedthe Fake Negative Weighted (FNW) approach, in which eacharriving instance is ﬁrstly labeled as negative, and then cor-rected upon its conversion at a later time. Each fake nega-tive instance may have a side-effect for the learned modeluntil it is amended. This side-effect is ampliﬁed if the datadistribution frequently changes. For example, in the begin-ning of a promotion event, user clicks may increase dra-matically while most conversions come after a certain time.Such overwhelming fake negatives may harm the predictivemodel. Instead of blindly labeling each incoming example asa negative instance, Yasui et al. (2020) proposed a FeedbackShift Importance Weighting (FSIW) algorithm, in which themodel waits for the real conversion in a certain time inter-val. However, FSIW does not allow the data correction evena conversion event took place afterward. We argue that pos-itive examples are important for delayed feedback predic-tion as the positive examples are always more scarce thanthe negative examples. Moreover, FSIW may lack model- a r X i v : . [ c s . L G ] J a n reshness due to a long-time wait. Therefore, either updatingthe model in nearly real time (Ktena et al. 2019), or waitinga sufﬁciently long time for conversion (Yasui et al. 2020)may not be able to address the delayed feedback problem inthe streaming CVR prediction.For unbiased CVR estimation in the online settings, wepropose to wait for a time interval which is modeled as a dis-tribution. The readily available conversion information al-lows the model to trade-off the label correctness and onlinemodel-freshness, which are achieved in FSIW and FNW,respectively. Due to the introduction of the observed timedistribution, delayed positive samples can be better handledthan FNW via important sampling techniques. Especially inscenarios of promotion events, FNW may fail to have unbi-ased estimation due to that the distribution of positive sam-ples in observed time may be dramatically different from theroutines. On the other hand, FSIW is able to guarantee thelabel correctness but lacks of model-freshness. Furthermore,the model is not able to correct instance label even the de-layed positive instance comes later. The introduction of timedistribution in our proposal helps the model correct the labelof an instance by degrading the weight of negative instanceand upgrading the weight of positive instance.In this work, we propose Elapsed-Time Sampling De-layed Feedback Model (ES-DFM), which models the rela-tionship between the observed conversion distribution andthe true conversion distribution. We optimize the expecta-tion of true conversion distribution via importance samplingunder the elapsed-time sampling distribution. We further es-timate the importance weight for each instance, which isused as the weight of loss function in CVR prediction. Todemonstrate the effectiveness of ES-DFM, we conduct ex-tensive experiments on two widely-used datasets — a publicads conversion logs provided by Criteo, and a private dataset provided by Taobao. Experimental results conﬁrm thatour method consistently outperforms the previously state-of-the-art results in most of the cases. Our main contributioncan be summarized as following:• To the best of our knowledge, we are the ﬁrst to studythe trade-off between waiting more accurate labels andexploiting fresher training data in the context of streamingCVR prediction.• By explicitly modeling the elapsed time as a probabilitydistribution, we achieve the unbiased estimation of trueconversion distribution. Particularly, our model is shownto be robust even if the data distribution is different fromthe routines.• We provide a set of rigorous experimental setups forstreaming training and evaluation, which better alignswith industrial systems, and can be easily applied to real-world applications. Related Work

Delayed Feedback Models

The mostly cited work that were addressing the delayedfeedback problem came from Chapelle (2014), in which the authors stated that such a problem is related to the sur-vival time analysis (John D. Kalbﬂeisch 2002). The DelayedFeedback Model (DFM) assumed an exponential delay forconversion time distribution, and, based on that, proposedtwo models: one focusing on CVR prediction and the otheron conversion delay prediction. Built on top of the DFMmodel, Yoshikawa and Imai (2018) further proposed a non-parametric delayed feedback model (NoDeF), in which thedelay time was modeled without any parametric assump-tions. One signiﬁcant drawback of the above methods wasthat both of them only attempted to optimize the observedconversion information rather than the actual delayed con-version.

Importance Sampling

Using samples from one distribution to estimate the expec-tation with respect to another distribution can be achievedby importance sampling method. Ktena et al. (2019) pro-posed fake negative weighted method (FNW) to optimizethe ground truth CVR prediction objective based on impor-tance sampling. Under the assumption that all samples areinitially labeled as negative, the delayed feedback problemcan be resolved by FNW in expectation. However, in thestreaming setting, every fake negative will affect the modelnegatively until it’s corresponding positive duplicate arrives.This negative effect can be ampliﬁed drastically under dis-tribution change. Yasui et al. (2020) proposed a feedbackshift importance weighting method (FSIW), where the im-portance weight is estimated with the aid of waiting time in-formation. However, FSIW does not allow duplicated sam-ples, thus cannot correct the mislabeled samples using thesubsequent positive labels by inserting duplicates.

Delayed Bandits

The delayed feedback in the bandit algorithm has been re-searched (Joulani, Gy¨orgy, and Szepesv´ari 2013; Mandelet al. 2015; Cesa-Bianchi, Gentile, and Mansour 2019).The aforementioned approaches often provide efﬁcient andprovably optimal algorithms for delayed feedback scenar-ios. However, such methods naturally wait for having re-ceived enough feedback before actually learning something,which may be quite unsuitable in a non-stationary environ-ment. Vernade, Gy¨orgy, and Mann (2020) deﬁned a newstochastic bandit model and addressed the real-world mod-elling issues when dealing with non-stationary environmentsand delayed feedback. However, the objective in the banditproblem is to sequentially make decisions in order to mini-mize the cumulative regret, our goal is to predict the CVR inorder to derive a bid price in ad auction.

Background

In this work, we focus on the CVR prediction task, whichtakes the user features x u and the item features x i as in-puts, all the features are denoted by x , and aims to learn theprobability that the user converts on the item. y ∈ { , } indicates the conversion label, where y = 1 means the con-version, otherwise y = 0 . Ideally, the CVR model is trainedigure 1: An illustration of different kind of time informa-tion for the delayed feedback tasks.on top of the training data ( x, y ) drawn from the data distri-bution of ground-truth p ( x, y ) , thereby optimizing the idealloss shown as follows: L ideal = E ( x,y ) ∼ p ( x,y ) (cid:96) ( y, f θ ( x )) (1)where f is the CVR model function, and θ is the parameter. (cid:96) is the classiﬁcation loss, and the widely-used cross entropyis adopted. However, due to the delayed feedback problem,the observed distribution of the training data q ( x, y ) oftendeviates from the distribution of the ground-truth p ( x, y ) .Therefore, the ideal loss L ideal is unavailable.To formulate such a delayed feedback setting more pre-cisely, we introduce three time points and correspondingtime intervals in Figure 1. These three time points are theclick time c t when a user clicks an item, the conversion time v t when a conversion action happens, and the observationtime o t when we extract the training samples. Then the timeinterval between c t and o t is denoted as the elapsed time e ,and the time interval between c t and the v t is denoted as the delayed feedback time h . Therefore, the samples are labeledas y = 1 (positive) in the training data, when e > h , other-wise some positive samples are mislabeled as y = 0 (fakenegative) when e < h . Proposed Method

In order to realize ﬂexible control on the waiting time, weassume the elapsed time is drawn from an elapsed time dis-tribution p ( e | x ) . Then we developed a probabilistic modelthat combines the elapsed time distribution p ( e | x ) , the de-lay time distribution p ( h | x, y = 1) and the conversion rate p ( y = 1 | x ) into a uniﬁed framework. To achieve an unbi-ased estimation of the actual CVR prediction objective, wepropose an importance weighting method corresponding toour elapsed sampling method. Then we provide a practicalestimation of the importance weights, and give an analysisof the bias introduced by this estimation, which can guide uson designing an appropriate elapsed time distribution p ( e | x ) . Elapsed-Time Sampling Delayed Feedback Model

To strike the balance between obtaining accurate feedbackinformation and keeping model freshness, a reasonable wait-ing time (elapsed time) should be integrated into the model-ing process. Moreover, the elapsed time e should be a dis-tribution depend on x , i.e, p ( e | x ) . For example, users need more time to consider when buying high-priced products,thus a long waiting time is required. When a click x i arrives,an elapsed time e i is drawn from p ( e | x i ) . Then we wait thesample x i for the time interval of e i before assigning a la-bel, and subsequently train on the data. By introducing thetime distribution, we propose our Elapsed-Time SamplingDelayed Feedback Model (ES-DFM), which modeling therelationship between the observed conversion distribution q ( y | x ) and the true conversion distribution p ( y | x ) , accord-ing to: q ( y = 0 | x ) = p ( y = 0 | x )+ p ( y = 1 | x ) p ( h > e | x, y = 1) (2) q ( y = 1 | x ) = p ( y = 1 | x ) p ( h ≤ e | x, y = 1) (3)where p ( h > e | x, y = 1)= (cid:90) ∞ (cid:20) p ( e | x ) (cid:90) ∞ e p ( h | x, y = 1) dh (cid:21) de (4) p ( h ≤ e | x, y = 1)= (cid:90) ∞ (cid:20) p ( e | x ) (cid:90) e p ( h | x, y = 1) dh (cid:21) de (5)At the time of model training, some conversions that willoccur eventually have not yet been observed, and previousmethods like DFM and FSIW have ignored these conver-sions. We argue this is important for a delayed feedback taskas the positive examples are way more scarce than the nega-tive examples, and the positives may deﬁne the direction formodel optimization. Therefore, in this work, as soon as theuser engages with the ad, the data will be sent (duplicated ifthere is already a fake negative) to the model with a positivelabel. Then, q ( y | x ) should be re-normalized as following: q ( y = 0) = p ( y = 0) + p ( y = 1) p ( h > e | y = 1)1 + p ( y = 1) p ( h > e | y = 1) (6) q ( y = 1) = p ( y = 1)1 + p ( y = 1) p ( h > e | y = 1) (7)where the condition on x is omitted for conciseness, i.e. q ( y = 0) = q ( y = 0 | x ) , p ( y = 0) = p ( y = 0 | x ) , etc.Since we have inserted delayed positives, the total numberof samples will increase by p ( y = 1) p ( h > e | y = 1) , so weshould normalize by dividing p ( y = 1) p ( h > e | y = 1) .The number of negatives will not change, so dividing Eq.(2) by this normalizing factor yields Eq. (6). The numberof positives will increase by p ( y = 1) p ( h > e | y = 1) ,so the numerator of q ( y = 1) is p ( y = 1) p ( h < = e | y =1) + p ( y = 1) p ( h > e | y = 1) . Using the fact that p ( h < = e | y = 1) + p ( h > e | y = 1) = 1 yields Eq. (7). Importance Weight of ES-DFM

To obtain unbiased CVR estimation in delayed feedbackproblem, we optimize the expectation of p ( y | x ) via impor-ance sampling (Bishop 2007). First, we provide the theoret-ical background of importance sampling, as follows: L ideal = E ( x,y ) ∼ p ( x,y ) (cid:96) ( y, f θ ( x )) (8) = (cid:90) p ( x ) dx (cid:90) p ( y | x ) (cid:96) ( y, f θ ( x )) dy (9) = (cid:90) p ( x ) dx (cid:90) q ( y | x ) p ( y | x ) q ( y | x ) (cid:96) ( y, f θ ( x )) dy (10) ≈ E ( x,y ) ∼ q ( x,y ) p ( y | x ) q ( y | x ) (cid:96) ( y, f θ ( x )) (11) = L iw (12)where f is the CVR model function, and θ is the param-eter. (cid:96) is the classiﬁcation loss, and the widely-used crossentropy is adopted. Notice that we assume p ( x ) ≈ q ( x ) toobtain (11) from (10), which is reasonable since the pro-portion of delayed positive is small, and this approximationis also used by Ktena et al. (2019). According to (11), wecan optimize the ideal objective with a appropriate weight w ( x, y ) = p ( y | x ) q ( y | x ) . Second, we further provide the impor-tance weight under the proposed elapsed-sampling distribu-tion. From Equation (6) and (7), we can obtain: p ( y = 0 | x ) q ( y = 0 | x ) = [1 + p dp ( x )] p rn ( x ) (13) p ( y = 1 | x ) q ( y = 1 | x ) = 1 + p dp ( x ) (14)where p dp = p ( y = 1) p ( h > e ) (15) p rn = p ( y = 0) p ( y = 0) + p ( y = 1) p ( h > e ) (16) p dp ( x ) is the delayed positive probability, denote the proba-bility that a sample is a duplicated positive; p rn ( x ) is the realnegative probability, denote the probability that an observednegative is ground truth negative and will not convert.Finally, considering Eq. (8) to Eq. (14), the importanceweighed CVR loss function is: L niw = n (cid:88) ( x i ,y i ) ∈ ˜ D y i [1 + p dp ( x i )] log ( f θ ( x i ))+ (1 − y i ) [1 + p dp ( x i )] p rn ( x i ) log (1 − f θ ( x i )) (17)where ˜ D is the training data drawn from elapsed-time sam-pling distribution q ( x, y ) . Estimation of Importance Weight (IW)

The challenge of resolving the delayed feedback problemthrough importance sampling is that we need to estimate theimportance weights w ( x, y ) .In this work, we decompose w ( x, y ) into two parts: p dp ( x ) and p rn ( x ) , according to Eq. (13) and Eq. (14). Moreprecisely, we estimate these two probability with two binaryclassiﬁers. Namely, we train a classiﬁer f dp to predict theprobability of being a delayed positive (Eq. (15)), and train a classiﬁer f rn to predict the probability of being a real nega-tive (Eq. (16)). The model architecture of f dp ( x ) and f rn ( x ) is the same as the CVR prediction model. To construct thetraining dataset, for each sample ( x i , y i ) , an elapsed time e is drawn from p ( e | x i ) . Then, for the f dp model, the delayedpositives are labeled as , the others are labeled as ; Forthe f rn model, the observed positives are excluded, then thenegatives are labeled as and the delayed positives are la-beled as . In practice, all these needed labels are availablein a delayed data stream (for example, delayed by 30 daysto ensure label correctness), and the data selection can beachieved by a mask on the loss function, thus we train the f rn and f dp models jointly with a shared network in stream-ing training.Importance sampling methods usually suffer from highvariance due to the division of two probabilities. Our methodis less likely to introduce a large variance. The key is on howthe importance weight p ( y | x ) q ( y | x ) is calculated. The high vari-ance of importance sampling is mainly introduced by thelarge value of p ( y | x ) q ( y | x ) when q ( y | x ) (cid:28) p ( y | x ) at some ( x, y ) .However, we estimate the importance weight using the de-layed positive rate p dp and the real negative rate p rn (in Eq.(17)), and these two values are probabilities and boundedwithin [0 , . Bias Analysis of Estimated IW

The importance weighted loss function Eq. (17) is unbi-ased using ideal p dp and p rn . However, a bias may be in-troduced due to the estimated importance weights f dp and f rn . Through optimizing the loss function Eq. (17), and us-ing the estimated f dp , f rn instead of ideal p dp , p rn , the pre-dicted probability f ( x ) converges to: f ( x ) = p ( y = 1 | x ) p ( y = 1 | x ) + p neg ( x ) f rn ( x ) (18) p neg ( x ) = p ( y = 0 | x ) + p ( y = 1 | x ) p ( h > e | x ) (19) Proof sketch.

Take partial derivative of Eq. (17) with respectto f , and set the derivative to zero. A detailed proof is givenin the supplementary material‡.From Eq. (18) and Eq. (19), we can draw the follow-ing observations, which can guide us to design appropriateelapsed-time sampling distribution p ( e | x ) :• First, if f rn is perfectly correct, we have f rn = p rn , then f ( x ) = p ( y = 1) , thus leading to no bias. However, inpractice, f rn is learned through historical data, bias al-ways exists.• Second, the bias is also related to p ( y = 1 | x ) accordingto Eq. (18) and Eq. (19). Therefore, if the absolute valueof conversion rate is large, the bias introduced by f rn maybe larger.• Last, the sampling distribution p ( e | x ) can be used to con-trol the bias. If e is long, p ( h > e ) will be smaller. Thus p ( y = 0)+ p ( y = 1) p ( h > e ) will be close to p ( y = 0 | x ) . f rn will be more close to since there are few fake nega-tives.Thus p neg ( x ) f rn ( x ) is more close to p ( y = 0 | x ) .herefore, we can control the waiting time(elapsed time)distribution p ( e | x ) to reduce bias, which is the core to real-ize the aforementioned trade-off and is the missing part ofexisting methods. Experiments * To evaluate the proposed model, we conduct a set of experi-ments to answer the following research questions:

RQ1

How does ES-DFM perform, compared to the state-of-the-art models for the streaming CVR prediction task?

RQ2

How do different choices of elapsed time affect theperformance? What is the best elapsed time of the dataset?

RQ3

How do mislabeled samples affect importance weight-ing methods in streaming training?

RQ4

How does ES-DFM perform in online recommendersystems?

Datasets

Public Dataset

We use the Criteo dataset † usedin Chapelle (2014) to evaluate the proposed method.This dataset is formed by Criteo live trafﬁc data in a periodof 60 days, which corresponds to conversions after a clickhas occurred. Each sample is described by a set of hashedcategorical features and a few continuous features. It alsoincludes the timestamps of the clicks and those of theconversions, if any. The statistics of Criteo dataset is shownin Table 1. Taobao Dataset

We collect × samples in a period of14 days from the daily click and conversion logs in Taobaosystem, which consist of the user and item features withthe labels (i.e., click or conversion) for the CVR task. Thefeature set of an item contains several categorical featuresand continuous features. The statistics of Taobao Dataset isshown in Table 1. Dataset Preprocessing

We divide both public and anony-mous dataset into two parts evenly. We use the ﬁrst part formodel pre-training and achieve a well initialized CVR pre-diction model. We use the second part for streaming datasimulation to evaluate compared methods.

Evaluation Metrics

We adopt three widely used metrics for the CVR predictiontask (Ni et al. 2018; Zhou et al. 2019; Ktena et al. 2019; Ya-sui et al. 2020), which show a model’s performance from dif-ferent perspectives. The ﬁrst metric is area under ROC curve(

AUC ) which assesses the pairwise ranking performance ofthe classiﬁcation results between the conversion and non-conversion samples. The second metric is area under theprecision-recall curve (

PR-AUC ) which is more sensitivethan AUC in skewed data like CVR prediction task (Ya-sui et al. 2020). The last metric is negative log likelihood(

NLL ), which is sensitive to the absolute value of the CVRprediction (Chapelle 2014). In a CPA model, the predicted * The code for reproducing our results on public dataset is avail-able at https://github.com/ThyrixYang/es dfm † https://labs.criteo.com/2013/12/conversion-logs-dataset/ probabilities are important since they are directly used tocompute the value of an impression. Streaming Experimental Protocol

We have designed an experimental evaluation method for thestreaming CVR prediction, which can fully verify the perfor-mance of different methods in the online learning settings.In this work, we divide the streaming dataset into multi-ple datasets according to the click timestamp, each of whichcontains one hour data. Following the online training man-ner of industrial systems, the model is trained on the t -thhour data and tested on the t + 1 -th hour data, then trainedon the t + 1 -th hour data and tested on the t + 2 -th hour data,and so on and so forth. Note that, the training data is re-constructed with fake negatives, while evaluation data is theoriginal data. Therefore, we report the weighted metrics ofthe evaluation dataset of different hours to verify the overallperformance of different methods on streaming data. Compared Methods

We compare our method with the state-of-the-art methods:–

Pre-trained : A CVR model without any ﬁnetuning.–

Vanilla Finetune Model : A model ﬁnetuned on top of thepre-trained model using the streaming data, which is thebaseline method.–

Delayed Feedback Model (DFM) (Chapelle 2014): Amodel ﬁnetuned on top of the pre-trained model using de-layed feedback loss.–

Fake Negative Weighted (FNW) (Ktena et al. 2019): Amodel ﬁnetuned on top of the pre-trained model using thefake negative weighted loss.–

Fake Negative calibration(FNC) (Ktena et al. 2019): Amodel ﬁnetuned on top of the pre-trained model using thefake negative calibration loss.–

Feedback Shift Importance Weighting (FSIW) (Yasuiet al. 2020): The pre-trained model will be ﬁne-tuned us-ing the FSIW loss and pre-trained auxiliary model.–

Elapsed-Time Sampling Delayed Feedback Model(ES-DFM) : Our proposed method which try to keep themodel fresh while introducing low bias.We also reported performance of an

Oracle* model: Amodel ﬁnetuned using the ground truth label instead of ob-served label, assuming the conversion label is known at clicktime. This is the upper bound of possible improvements,where the delayed feedback problem does not exist. The as-terisk* denotes that it’s not a baseline method.

Parameter Settings

For fair comparison, all hyper-parameters are tuned care-fully for all compared methods. The feature engineering ofthe numerical features and the categorical features is thesame as the settings in the work (Chapelle 2014). Since wemainly discuss the delayed feedback issue in this paper, themodel architecture is a simple MLP model with the hiddenunits ﬁxed for all models with [256 , , . The activa-tion functions are Leaky ReLU and every hidden layer is ataset Table 1: Statistics of Criteo and Taobao Dataset.

Method Criteo Dataset Taobao DatasetAUC PR-AUC NLL

R-AUC R-PR-AUC R-NLL

AUC PR-AUC NLL

R-AUC R-PR-AUC R-NLL

Pre-trained 0.8307 0.6251 0.4009 -0.9212 -0.2058 0.2139 0.8731 0.6525 0.1156 -1.0374 -0.5217 -0.2419Vanilla 0.8376 0.6288 0.4047 0.0000 0.0000 0.0000 0.8842 0.6645 0.1141 0.0000 0.0000 0.0000Oracle* 0.8450 0.6469 0.3868 1.0000 1.0000 1.0000 0.8949 0.6875 0.1079 1.0000 1.0000 1.0000DFM 0.8132 0.5784 1.2599 -3.2581 -2.7833 -47.645 0.8702 0.6471 0.1271 -1.3084 -0.7565 -2.0968FSIW 0.8290 0.6189 0.4099 -1.1432 -0.5479 -0.2891 0.8735 0.6591 0.1149 -0.9971 -0.2348 -0.1290FNC 0.8373 0.6267 0.4382 -0.0393 -0.1147 -1.8646 0.8851 0.6669 0.1142 0.0841 0.1043 -0.0161FNW 0.8373 0.6313 0.4033 -0.0308 0.1400 0.0773 0.8845 0.6672 0.1137 0.0280 0.1174 0.0645

ES-DFM 0.8402* 0.6393* 0.3924* 0.3560 0.5799 0.6831 0.8895* 0.6762* 0.1112* 0.4953 0.5087 0.4677

Table 2: Performance comparisons of proposed model with baseline models on AUC, PR-AUC and NLL metrics. The boldvalue marks the best one in one column, while the underlined value corresponds to the best one among all baselines. Here,* indicates statistical signiﬁcance improvement compared to the best baseline measured by t-test at p -value of 0.05. R-AUC , R-PR-AUC and

R-NLL are relative metrics indicating the improvements within the delayed feedback gap.followed by a BatchNorm layer (Ioffe and Szegedy 2015)to accelerate learning. Adam (Kingma and Ba 2015) is usedas the optimizer with the learning rate of − . L2 regular-ization strength is − . We describe the detailed setting ofcompared methods in the Supplementary Material ‡ due tothe page limit. Choice of p ( e | x ) The sampling elapsed time distribution p ( e | x ) can be de-signed based on expert knowledge and the aforementionedbias analysis. For example, users need more time to con-sider when buying high-priced products, thus a long waitingtime is required. However, the public dataset is anonymized,where information like price-level is unavailable. To verifythe effectiveness of introducing p ( e | x ) in the streaming set-tings, we perform a simpliﬁed implementation of p ( e | x ) .More precisely, we set p ( e = c | x ) = 1 where c is a con-stant, which means p ( e | x ) degenerates to a Dirac distribu-tion. This brings us two following advantages. First, we canstrike the balance between obtaining accurate feedback in-formation and keeping model freshness with a single param-eter c . Second, we conducted experiments with different c inthe public dataset, and the experimental results show thatchoosing the best c can signiﬁcantly improve performance.The c is also tuned on the private dataset and we report thebest result which is achieved using c = 1 . Standard Streaming Experiments: RQ1

From Table 2, we can see that our proposed method im-proves the performance signiﬁcantly against all the base-lines and achieves the state-of-the-art performance. More- ‡ Some experiment details and discussion are provided at https://github.com/ThyrixYang/es dfm/blob/master/aaai21 sup.pdf over, some further observations can be made. First, the per-formance of DFM and FSIW is worse than the vanilla base-line on both the public and Taobao Dataset. This is becauseDFM is difﬁcult to converge, thus failing to achieve a goodperformance in streaming CVR prediction, and FSIW doesnot allow the data correction once a conversion took placeafterwards, which is important for delayed feedback. Sec-ond, in most cases, FNC and FNW perform better than thevanilla baseline. Specially, FNW outperforms the baselinein both PR-AUC and NLL, which is consistent with the re-sults reported in Ktena et al. (2019). Third, existing methodsshow little superior performance in terms of AUC, while ourmethod outperform the best baseline by 0.26% and 0.44%AUC scores on the Criteo and Taobao Dataset, respectively.As reported in Zhou et al. (2018), DIN improves AUC scoresby 1.13% and the improvement of online CTR is 10.0%,which means a small improvement in ofﬂine AUC is likelyto lead to a signiﬁcant increase in online CTR. In our prac-tice, for cutting-edge CVR prediction models, even 0.1%of AUC improvement is substantial and achieves signiﬁcantonline promotion.We further analyze the maximum beneﬁt that can beachieved by resolving the delayed feedback problem. Themaximum beneﬁt is deﬁned as the performance gap betweenthe oracle model and baseline. Therefore, the goal of anymethod that tackling delayed feedback problem is to nar-row this gap. We report three relative metrics within theperformance gap, i.e, Relative-AUC(R-AUC), Relative-PR-AUC(R-PR-AUC) and Relative-NLL(R-NLL). As shownin Table 2, our method can narrow the delayed feedbackgap signiﬁcantly comparing to other methods, and the abso-lute improvement is larger when the delayed feedback gapis larger. Elapsed Time(hour) A U C Elapsed Time(hour) P R - A U C Elapsed Time(hour) N LL Figure 2: Experiments on the effect of elapsed time on performance. We control the elapsed time by a parameter c , which is thevalue on the x axis. Disturbance A U C ProposedFNWFSIWPre-trained 0.550.60 0.1 0.1 0.3 0.5 0.5 0.5 0.7 0.9 0.90.350.40

Disturbance P R - A U C ProposedFNWFSIWPre-trained 0.1 0.3 0.5 0.7 0.9

Disturbance N LL ProposedFNWFSIWPre-trained

Figure 3: The experiment on resistance to disturbation. x axis is the disturbation strength which controls the portion of positivesamples to be ﬂipped. Inﬂuence of Elapsed Time: RQ2

To verify the performance of different choices of elapsedtime, we have conducted experiments using different valuesof c on the Criteo dataset. As shown in Figure 2, the best c on the Criteo dataset is around 15 minutes, where about 35%conversions can be observed. Moreover, larger or smaller c will reduce the performance. The performance decreasesslowly on smaller c , which indicates that the bias introducedby the importance weighting model is small. The perfor-mance decreases faster on larger c , which indicates that thedata freshness matters more when c increase, and a c largerthan 1 hour will signiﬁcantly harm the performance. Experiment on Robustness: RQ3

In delayed feedback setting, the same sample may be la-beled as negative or positive. It is closely related to learningwith noisy labels(Natarajan et al. 2013), where some of thelabels are randomly ﬂipped. We hypothesis that a methoddealing with delayed feedback problem should not only cor-rect incorrect labels, but also reduce the negative effect ofthe incorrect labels before they can be corrected or the cor-rection fails (for example, if the weighting model deviate alot, the bias will be large and correction will fail). Thus weconducted a robustness experiment. We randomly select d portion of all the positive samples in streaming dataset, thenswap it’s label(and click time and pay time) with a randomselected negative one. Note that we do not disturb on thepre-training dataset, so the initial CVR model and the pre-trained importance weighting models are not disturbed. Weconducted experiments with different disturbance strength d , the results are shown in Figure 3. We can see that our method is more resistant to disturbance comparing to FNWand FSIW, and the performance gap is larger when distur-bance increases (especially on NLL). We give an intuitiveanalysis about the weak robustness of FNW and FSIW inthe Supplementary Material‡. Online Evaluation: RQ4

We conducted an A/B test in our online evaluation frame-work. We observed a steady performance improvement,AUC increases by 0.3% within a 7 days window comparedwith the best baseline, CVR increases by 0.7%, GMV(GrossMerchandise Volume) increases by 1.8%, where GMV iscomputed by the transaction number of items multiplied bythe price of each item. The online A/B testing results alignwith our ofﬂine streaming evaluation and show the effective-ness of ES-DFM in industrial systems.

Conclusion

The trade-off between the label accuracy and model fresh-ness in streaming training setting has never been consid-ered, which is an active decision of the method rather thana passive feature in ofﬂine setting. In this paper, we proposeelapsed-time distribution to balance the label accuracy andmodel freshness to address the delayed feedback problemin the streaming CVR prediction. We optimize the expec-tation of true conversion distribution via importance sam-pling under the elapsed-time sampling distribution. More-over, we propose a rigorous streaming training and testingexperimental protocol, which aligns with real industrial ap-plications better. Finally, extensive experiments show the su-periority of our approach. eferences

Bishop, C. M. 2007.

Pattern recognition and machine learn-ing . Springer.Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2019. Delayand cooperation in nonstochastic bandits.

The Journal ofMachine Learning Research

KDD , 1097–1105. ACM.Chapelle, O.; Manavoglu, E.; and Rosales, R. 2014. Sim-ple and scalable response prediction for display advertising.

TIST

KDD , 1569–1577.Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Ac-celerating Deep Network Training by Reducing Internal Co-variate Shift. In

ICML , 448–456. PMLR.John D. Kalbﬂeisch, R. L. P. 2002.

The Statistical Analysisof Failure Time Data .Joulani, P.; Gy¨orgy, A.; and Szepesv´ari, C. 2013. OnlineLearning under Delayed Feedback. In

ICML , 1453–1461.JMLR.org.Jugovac, M.; Jannach, D.; and Karimi, M. 2018. Stream-ingrec: a framework for benchmarking stream-based newsrecommenders. In

Recsys , 269–273.Kingma, D. P.; and Ba, J. 2015. Adam: A method forstochastic optimization. In

ICLR .Ktena, S. I.; Tejani, A.; Theis, L.; Myana, P. K.; Dilipkumar,D.; Husz´ar, F.; Yoo, S.; and Shi, W. 2019. Addressing de-layed feedback for continuous training with neural networksin CTR prediction. In

RecSys , 187–195. ACM.Lee, K.-c.; Orten, B.; Dasdan, A.; and Li, W. 2012. Esti-mating conversion rate in display advertising from past er-formance data. In

KDD , 768–776.Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.;and Gai, K. 2018. Entire space multi-task model: An effec-tive approach for estimating post-click conversion rate. In

SIGIR , 1137–1140.Mandel, T.; Liu, Y.-E.; Brunskill, E.; and Popovic, Z. 2015.The Queue Method: Handling Delay, Heuristics, Prior Data,and Evaluation in Bandits. In

AAAI , 2849–2856.Natarajan, N.; Dhillon, I. S.; Ravikumar, P.; and Tewari, A.2013. Learning with Noisy Labels. In

NIPS .Ni, Y.; Ou, D.; Liu, S.; Li, X.; Ou, W.; Zeng, A.; and Si,L. 2018. Perceive Your Users in Depth: Learning UniversalUser Representations from Multiple E-commerce Tasks. In

KDD , 596–605.Vernade, C.; Gy¨orgy, A.; and Mann, T. A. 2020. Non-Stationary Delayed Bandits with Intermediate Observations.In

ICML . PMLR.Yasui, S.; Morishita, G.; Fujita, K.; and Shibata, M. 2020. AFeedback Shift Correction in Predicting Conversion Ratesunder Delayed Feedback. In

WWW ’20: The Web Confer-ence 2020 , 2740–2746. ACM / IW3C2. Yoshikawa, Y.; and Imai, Y. 2018. A Nonparametric De-layed Feedback Model for Conversion Rate Prediction .Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu,X.; and Gai, K. 2019. Deep interest evolution network forclick-through rate prediction. In

AAAI , volume 33, 5941–5948.Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan,Y.; Jin, J.; Li, H.; and Gai, K. 2018. Deep interest net-work for click-through rate prediction. In