[PDF] A Practical Incremental Method to Train Deep CTR Models

Abstract

Deep learning models in recommender systems are usually trained in the batch mode, namely iteratively trained on a fixed-size window of training data. Such batch mode training of deep learning models suffers from low training efficiency, which may lead to performance degradation when the model is not produced on time. To tackle this issue, incremental learning is proposed and has received much attention recently. Incremental learning has great potential in recommender systems, as two consecutive window of training data overlap most of the volume. It aims to update the model incrementally with only the newly incoming samples from the timestamp when the model is updated last time, which is much more efficient than the batch mode training. However, most of the incremental learning methods focus on the research area of image recognition where new tasks or classes are learned over time. In this work, we introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules (namely, data, feature and model module). Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency. We conduct extensive experiments on a public benchmark and a private dataset to demonstrate the effectiveness of our proposed method.

Full PDF

AA Practical Incremental Method to Train Deep CTR Models

Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He

Noah’s Ark Lab, Huawei, China.{wangyichao5,huifeng.guo,tangruiming,liuzhirong,hexiuqiang1}@huawei.com

ABSTRACT

Deep learning models in recommender systems are usually trainedin the batch mode, namely iteratively trained on a fixed-size win-dow of training data. Such batch mode training of deep learningmodels suffers from low training efficiency, which may lead to per-formance degradation when the model is not produced on time. Totackle this issue, incremental learning is proposed and has receivedmuch attention recently. Incremental learning has great potential inrecommender systems, as two consecutive window of training dataoverlap most of the volume. It aims to update the model incremen-tally with only the newly incoming samples from the timestampwhen the model is updated last time, which is much more efficientthan the batch mode training. However, most of the incrementallearning methods focus on the research area of image recognitionwhere new tasks or classes are learned over time. In this work, weintroduce a practical incremental method to train deep CTR models,which consists of three decoupled modules (namely, data, featureand model module). Our method can achieve comparable perfor-mance to the conventional batch mode training with much bettertraining efficiency. We conduct extensive experiments on a publicbenchmark and a private dataset to demonstrate the effectivenessof our proposed method.

KEYWORDS

Incremental Learning, Deep Learning, Recommender System, Click-Through Rate Prediction

ACM Reference Format:

Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He. 2019.A Practical Incremental Method to Train Deep CTR Models. In

ACM Con-ference (KDD’20), Aug 22–27, 2020, San Diego, California.

ACM, New York,NY, USA, 8 pages. https://doi.org/10.1145/3298689.3347033

Internet users can access a huge number of online products andservices, therefore it becomes difficult for users to identify whatmight interest them. To reduce information overload and to satisfythe diverse needs of users, personalized recommender systems areplaying an important role in modern society. Accurate personalizedrecommender systems benefit both demand-side and supply-sideincluding publisher and platform.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Click-Through Rate (CTR) prediction is to estimate the probabil-ity that a user will click on a recommended item under a specificcontext. It plays a crucial role in personalized recommender sys-tem, especially in the business of app store and online advertising.Nowadays, deep learning approaches have attracted more and moreattention due to the superiority over prediction performance andautomated feature exploration. Therefore, many industrial compa-nies deploy deep CTR models in their recommender system, suchas Wide & Deep [1] in Google Play, DeepFM [4] and PIN [12] inHuawei AppGallery, DIN [19] and DIEN [18] in Taobao, etc.However, every coin has two sides. To achieve good perfor-mance, Deep CTR models with complicated architectures need tobe trained on huge volume of training data for several epochs, there-fore they all suffer from low training efficiency. Such low trainingefficiency (namely, long training time) may lead to performancedegrade when the model is not produced on time. We observe suchperformance degradation when the model stops updating in apprecommendation scenarios in Huawei AppGallery, as presented inFigure 1. For instance, if the model stops updating for 5 days, themodel performance degrades 0.66% in terms of AUC, which wouldlead to significant loss of revenue and also user experience. Hence,

Figure 1: Model performance degrades when the modelstops updating for different days. X-axis is different gaps be-tween training set and test set. as can be observed, how to improve training efficiency of DeepCTR models without hurting its performance is an essential prob-lem in recommender system. Distributed learning and incrementallearning are two common paradigms to tackle this problem fromdifferent perspectives. Distributed learning requires extra compu-tational resources and distributes training data and the model tomultiple nodes to accelerate training. On the other side, incremen-tal learning changes the training procedure from batch mode toincrement mode, which utilizes just the most recent data to update a r X i v : . [ c s . I R ] S e p DD’20, Aug 22–27 2020, San Diego, California Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He the current model. However, most deep models in industrial recom-mender system are trained in the batch mode, where a fixed-sizewindow of training data (usually in a multi-billion scale) is usedto train the model iteratively. In this work, we focus on devisingincremental method to train deep CTR models, which aims to im-prove the training efficiency significantly without degrading themodel performance.However, to the best of our knowledge, most of the existingincremental learning methods mainly concentrate on image recog-nition field where new tasks or classes are learned over time. Whileincremental learning for deep CTR models faces different circum-stances from image recognition, such as incoming new features,etc, therefore there is a need to look into this topic seriously. Inthis paper, we propose a practical incremental method

IncCTR fordeep CTR models. As presented in Figure 2, three decoupled mod-ules are integrated in our model:

Data Module , Feature Module and

Model Module . Data module mimics the functionality of a reservoir,constructing training data from both historical data and incomingdata. Feature module is designed to handle new features from in-coming data and initialize both existing features and new featureswisely. Model module employs knowledge distillation to fine-tunethe model parameters, balancing learning knowledge from the pre-vious model and from incoming data. More specifically, we lookinto two different choices for the teacher model.The main contributions of this work are listed as follows: • We highlight the necessity of incremental learning in rec-ommender system through rigorous offline simulations. Wepropose a practical incremental method, IncCTR, to traindeep CTR models. • IncCTR consists of data module, feature module and modelmodule, which have the functionality of constructing train-ing data , handling new features and fine-tuning model pa-rameters, respectively. • We conduct extensive experiments on a public benchmarkand a private industrial dataset from Huawei AppGalleryto demonstrate that IncCTR is able to achieve comparableperformance to the batch mode of training, with significantimprovement on training efficiency. Moreover, ablation studyof each module in IncCTR are performed.The rest of the paper is organized as follows. In Section 2, weintroduce some preliminaries for better understanding our methodand application. We elaborate our incremental learning frameworkIncCTR and three individual modules in detail in Section 3. Insection 4, results of comparison experiments and ablation studiesare reported to verify the effectiveness of our proposed framework.Lastly, we draw a conclusion and discuss future work in section 5.

In this section, we introduce some notations, basic knowledge aboutdeep CTR models. Also, the training modes (batch mode and incre-mental mode) are presented and compared.

Recently, various deep CTR models are proposed, such as DeepFM [4],Wide & Deep [1], PIN [12], DIN [19], and DIEN [18]. Generally, deep CTR models include three parts: embedding layer, interactionlayer, and prediction layer.

In most CTR prediction tasks, data is col-lected in a multi-field categorical form [12, 14, 16]. Each data in-stance is transformed into a high-dimensional sparse (binary) vectorvia one-hot encoding [5]. For example, the raw data instance (Gen-der=Male, Height=185, Age=18, Name=Bob) can be representedas: ( , ) (cid:124)(cid:123)(cid:122)(cid:125) Gender=Male ( , ..., , , ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Height=185 ( , , ..., , ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Age=18 ( , , , .., ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Name=Bob . An embedding layer is applied to compress the raw features tolow-dimensional vectors before feeding them into neural networks.For a univalent field, (e.g., “Gender=Male”), its embedding is thefeature embedding; For a multivalent field (e.g., “Interest=Football,Basketball”), the field embedding is the average of the feature em-beddings [2]. More formally, in an instance, each field f i ( ≤ i ≤ m ) is represented as a low-dimensional vector e i ∈ R × k ,where m is the number of fields and k is the embedding size. There-fore, each instance can be represented as an embedding matrix E = ( e ⊤ , e ⊤ , ..., e ⊤ m ) ⊤ ∈ R m × k . Assume there are n features, theembeddings of all the features form an embedding table E ∈ R n × k . The key challenge in CTRprediction is modelling feature interactions. Existing deep CTRmodels utilize product operation and multi-layer perception (MLP)to model the explicit and implicit feature interactions, respectively.For example, DeepFM [4] adopts Factorization Machine [13] tomodel the order-2 feature interactions and MLP to model the high-order feature interactions. The method about how to model featureinteractions is beyond the scope of this work, those who are inter-ested in such techniques, please refer to [4, 8, 9, 12, 15].After the interaction layer, the prediction ˆ y is generated as theprobability of the user will click on a specific item within such con-text. Then, the cross-entropy loss is used as the objective function: L CE ( y , ˆ y ) = − y log ˆ y − ( − y ) log ( − ˆ y ) . (1)with y as the label. In this section, we present and compare two different trainingmodes, namely batch mode and incremental mode.

The model trained in the batchmode is iteratively learned based on the data from a fixed-sizetime window. When new data is arriving, the time window slidesforward. As shown in Figure 3, “model 0” is trained based on datafrom day 1 to day 10. Then when new data (namely “day 11”) isarriving, a new model (namely “model 1”) is trained based on datafrom day 2 to day 11. For similar procedure, “model 2” is trainedon data from day 3 to day 12.

With incremental mode,the model is trained based on the existing model and new data. Asshown in Figure 3, the “Model 1” is trained based on the existingmodel “Model 0” (which is trained on data from day 1 to day 10),and data from day 11. Then “Model 1” turns into the existing model.

Practical Incremental Method to Train Deep CTR Models KDD’20, Aug 22–27 2020, San Diego, California

Figure 2: Overview of IncCTR architecture, where t indicates the incremental step.

Consequently, when data from day 12 arrives, “Model 2” is trainedbased on the “Model 1” and data from day 12.As can be seen, when training with batch mode, two consecutivetime window of training data overlap most of the volume. Forinstance, data from day 1 to day 10 and data from day 2 to day 11overlap the part from day 2 to day 10, where 80% of the volumeare shared. Under such circumstance, replacing batch mode withincremental mode improves the efficiency significantly, while suchreplacement is highly possible to retain the performance.

Figure 3: Training with Batch Mode V.S. Incremental Mode.Each block represents one day of training data.

An overview of our incremental learning framework IncCTR isshown in Figure 2. Three modules are designed from the perspec-tive of feature , model and data respectively to balance the trade-offbetween learning from historical data and incoming data. Specifi-cally, data module serves as a reservoir, constructing training datafrom both historical data and incoming data. Feature module han-dles new features from incoming data and initializes both existingfeatures and new features properly. Model module employs knowl-edge distillation to fine-tune the model parameters. In recommendation and information retrieval scenarios, the featuredimension is usually very high, i.e., in million- or even billion-scale [17]. The occurrence frequency of such large number of fea-tures follows long-tailed distribution, where only minor proportionof the features occur frequently and the rest are rarely presented.As observed in [10], half of the features in their model occur onlyonce in the whole training data. The features that are rarely oc-curred are difficult to be learned well. Therefore, when trainingwith batch mode, features are needed to be categorized to “fre-quent” or “infrequent” by counting the number of occurrences ofeach feature. More formally, a feature x with its occurrence S [ x ] larger than a pre-defined threshold T HR (i.e., S [ x ] > T HR ) is con-sidered as “frequent” and is learned as an individual feature. Therest “infrequent” features are treated as a special dummy feature

Others . After such processing, each feature is mapped to a unique idby some policy like auto-increment, hash-coding, and etc. We takean auto-increment policy F for simplicity. In batch mode, policy F is constructed from scratch by assigning unique ids to individual DD’20, Aug 22–27 2020, San Diego, California Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He features from the training data in a fixed-size window, where theunique ids increase one by one automatically.However, training with incremental mode brings additional issueas new features appear when new data comes in. As displayed inFigure 4, every block of new data brings a certain proportion ofnew features. For instance, as observed from Criteo dataset, thefirst block of new data imports 12% of new features compared tothe set of existing features before this block, while even the 14thblock still brings 4% of new features. Therefore, the policy F needsto be updated incrementally when new data comes in. It is possiblethat a feature x , which is previously treated as Others , is consideredas a unique feature if its occurrence S [ x ] is above the threshold T HR after new data coming in.

Figure 4: Proportion of new features compared to the set ofexisting features as blocks of new data comes in, observedfrom Criteo dataset.

After assigning proper ids to all the features, feature module inIncCTR initializes both existing features and new features. Whenwe start training with batch mode, all the feature embeddings E are initialized randomly. Whereas, in case of incremental mode,we initialize the embeddings of existing features E exist and theembeddings of new features E new separately.The functionality of feature module (namely, new feature assign-ment and feature embedding initialization) is presented in Algo-rithm 1. When new data comes in, we firstly update the occurrencesof each feature (line 3) and inherit the existing policy of featureassignment (line 4). If a feature from the new data is new withits occurrence larger than the threshold (line 6), it is added to thepolicy with id incremented by one (line 7). Feature embeddingsare initialized separately, depending on whether a feature is newor not. For an existing feature, it inherits from the embedding ofthe existing model as its initialization (line 11). Such inheritingtransfers the knowledge in historical data to the model that will betrained incrementally. For a new feature, its embedding is randomlyinitialized as no prior knowledge is available (line 12). In this section, we introduce the model module in IncCTR, whichtrains the model properly, such that the model still “remembers”the knowledge from historical data and also makes some “progress”from the new data.

Fine-tune . Besides the embedding of existing features, the net-work parameters are also inherited from the existing model aswarm-start. To fine-tune all the model parameters with incrementalmode, we apply some auxiliary tricks to achieve good performance.For instance, we conduct a lower learning rate for E exist comparedto that for E new . The training details of fine-tune are presentedin line 19 to line 25 in Algorithm 2. The model is optimized byminimizing cross entropy between prediction and groundtruth. Wetrain the model for a fixed number of epochs, where empiricallywe set the number to be 1 (line 25). Algorithm 1

Feature Module: New Feature Assignment and Fea-ture Embedding Initialization

Input: features of coming data X t ; labels of coming data Y t ; ex-isting model M t − ; existing policy of feature assignment F t − ;frequency of existing features S t − Output: initialized feature embeddings E t Initialize : feature frequency threshold

T HR New Feature Assignment : S t ← Feature _ Frequency _ Update ( S t − , X t ) F t ← F t − for all features f in X t do if f (cid:60) F t ∧ S t [ f ] > T HR then F t ← add ( f , | F t | + ) end if end for Feature Embedding Initialization : E exist ← E t − E new ← Random _ Initialize E t ← E exist ∪ E new Return : E t Knowledge distillation . Beyond “fine-tune” presented above,we introduce knowledge distillation (KD) method to enhance theknowledge learned from the historical data (namely, to avoid cata-strophic forgetting). Hinton et al. in [6] use KD to transfer knowl-edge from an ensemble of models into a single model for efficientdeployment, where KD loss is used to preserve knowledge from thecumbersome model through encouraging the outputs of distilledmodel to approximate that of cumbersome model. Similarly, theauthors of LwF [7] perform KD to learn new task while keepingknowledge on old tasks. Borrowing the similar idea, KD can alsobe used in incremental learning scenario to learn new knowledgefrom incoming data while preserving memory on historical data.Several options are available to design the teacher model whenwe apply KD in IncCTR. We present two such options. • KD-batch.

Outdated model trained with batch mode can bea natural choice for teacher model to distill the incrementalmodel, which can preserve performance on the historicaldata within a fixed-size window. We refer the KD methodwith such teacher trained with batch mode as “KD-batch”. • KD-self.

As training a model with batch mode as teachermodel needs extra computation resources, it is more con-venient to perform the previous incremental model as theteacher. In such case, the successive incremental model istrained with the supervision of the previous incremental

Practical Incremental Method to Train Deep CTR Models KDD’20, Aug 22–27 2020, San Diego, California model. We refer such a design as “KD-Self”. Similar idea isemployed in

BANs [3], where a consecutive student modelis initialized randomly while taught by the previous teachermodel, and the ensemble of multiple student generations wasconducted to achieve desirable performance. This work is inimage recognition field where all the models are trained inbatch mode, which is significantly different from our frame-work.When performing KD, we utilize the soft targets Y sof t generatedby the teacher model on the incoming data. The objective functionis formulated as follows: L = L CE ( Y , ˆ Y ) + L KD ( ˆ Y , Y sof t ) + R (2) L KD ( Y , Y sof t ) = L CE ( σ ( Zτ ) , σ ( Z sof t τ )) (3) L CE ( Y , ˆ Y ) = (cid:213) y i ∈ Y L CE ( y i , ˆ y i ) (4)The new objective function combines the standard binary cross-entropy L CE (·) (where Y and ˆ Y denote the groundtruth and out-puts of the new model respectively) and KD loss L KD (·) . KD loss L KD (·) is the cross entropy between ˆ Y and Y sof t (where Y sof t isthe prediction of the teacher model) which are computed based onlogits Z and Z sof t of two models. Temperature τ is applied to getsoft targets and R is the regularization term. The intuition of lossfunction in Equation 2 is that the knowledge of the distilled modelshould be precise to the new data (first term), while it should notbe significantly different from the knowledge of the teacher model(second term).The training details of KD-batch and KD-self are presented inline 3 to line 5 and line 11 to line 17 in Algorithm 2. The differencebetween KD-batch and KD-self is how the teacher model Teacher t is trained. Remind that the teacher model in KD-batch is an out-dated model trained with batch model while the teacher model inKD-self is the previous incremental model. We will compare theirperformance empirically in Experiment section. Given featuresof input data, the incremental model M t and the teacher model Teacher t make the predictions, as in line 4 and line 13. Then theincremental model M t is optimized by minimizing the loss functionin Equation 2, as in line 14. The train process terminates whenthe model is trained with at least one epoch and KD loss stopsdecreasing, as in line 17. From data perspective, one straightforward way to tackle the cat-astrophic forgetting problems is to train the incremental modelnot only based on the new data, but also based on some selectedhistorical data. We plan to implement a data reservoir to provideproper training data for incremental training. Some proportion ofdata in the existing reservoir and the new data are interleaved tobe the new reservoir. In this module, some problems are needed tolook into, such as what and which proportion of data in the existingreservoir should be kept. The implementation of data module is notfinished for now and will be a part of future work to accomplishour framework.

Algorithm 2

IncCTR

Input: features of incoming data X t ; labels of incoming data Y t ;existing model M t − ; Teacher Model Teacher t ; existing policyof feature assignment F t − ; frequency of existing features S t − Output:

Model M t Initialize : L = MaxInt ; ep =0; EPOCH=1 Θ t ← {Algorithm 1 ∪ network parameters} if Train with KD then Y sof t ← Inf erence ( Teacher t , X t ) M t = Train _ with _ KD ( X t , Y t , Y sof t , Θ t ) else M t = Train _ with _ FT ( X t , Y t , Θ t ) end if Return : M t Train_with_KD : while stopping criteria not satisfied do ˆ Y t = Inf erence ( M t , X t ) M t ← arg min Θ t (cid:16) λ L CE ( Y t , ˆ Y t ) + L KD ( ˆ Y t , Y sof t ) + R (cid:17) ep += 1 end while stopping criteria : L KD ( ˆ Y t , Y sof t ) increases and ep >= EPOCH Train_with_FT : while stopping criteria not satisfied do ˆ Y t = Inf erence ( M t , X t ) M t ← arg min Θ t (cid:16) L CE ( Y t , ˆ Y t ) + R (cid:17) ep += 1 end while stopping criteria : ep >= EPOCH In this section, we conduct experiments on a public benchmarkand a private dataset, aiming to answer the following researchquestions: • RQ1 : What is the performance of IncCTR compared to train-ing with batch mode? • RQ2 : What are the contribution of different modules in Inc-CTR framework? • RQ3 : How efficient is IncCTR compared to training withbatch mode?

To evaluate the effectiveness and efficiency of the proposed IncCTRframework, we conduct extensive experiments both on a publicbenchmark and a private dataset. • Criteo . This dataset is used to benchmark algorithms forclick-through rate (CTR) prediction. It consists of 24 days’consecutive traffic logs from Criteo, including 26 categoricalfeatures and 13 numerical features, and the first column aslabel indicating whether the ad has been clicked or not. http://labs.criteo.com/downloads/download-terabyte-click-logs/ DD’20, Aug 22–27 2020, San Diego, California Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He • HuaweiApp. In order to demonstrate the performance ofthe proposed method in real industrial tasks, we conductoffline experiments on a commercial dataset. HuaweiAppcontains 60 consecutive days of click logs collected fromHuawei AppGallery with user consent, consisting of appfeatures, anonymous user features and context features.For the ease of reproducing our experimental results, we presentthe details of data processing on Criteo data. In a nutshell, wefollow Kaggle Champion and [11], which involving data sam-pling, discretization and feature filtering. We do not give detailsfor processing HuaweiApp dataset for commercial reasons, but theprocedure is similar. • Data sampling: Considering data imbalance (only 3% sam-ples are positive), similar to [12], we apply negative downsampling to keep the positive ratio close to 50%. • Discretization: Both categorical and numerical features areexisting in Criteo. However the distribution of two kinds offeatures is quite different intrinsically [11]. In most of rec-ommendation models, numerical features are transformedto categorical features through bucketing or logarithm. Fol-lowing that, we use logarithm as the discretization methodin the formulation: v ← f loor ( loд ( v ) ) (5) • Feature filtering: Infrequent features are usually not veryinformative and may be noisy, so that it is hard for modelsto learn such features well. Therefore, features in a certainfield appearing less than 20 times are set a dummy feature

Others , following [11].

Evaluation Metrics.

We adopt AUC (Area Under Curve) and logloss(cross-entropy) as our evaluation metrics which are widely usedin CTR prediction models. Besides, it has been endorsed that animprovement of AUC and logloss at 0.1% can be considered as sig-nificant for a CTR prediction model [1, 4, 15].

Baseline.

Training model with batch mode is used as the base-line, to validate the effectiveness and efficiency of IncCTR. To fur-ther evaluate the impact on delay updating of models, we considerthe baseline with different delay days. More specifically, Batch_ i ( i = , , , , , ) represents the baseline model with i days’ delay. Implementation Details.

As the focus of this work is to comparethe effectiveness and efficiency of deep CTR models when trainingwith batch mode and incremental training by IncCTR, we choose apopular deep CTR model DCN [15] for such comparison. Observa-tions on other deep CTR models are similar.To mimic the training procedure in industrial scenarios, experi-ments are conducted over consecutive days. When training withbatch mode, all the data in the fixed size window (i.e., data in size- w window [ s , s + w ) , where s ∈ [ , T − w ] ) are utilized. While trainingwith incremental mode, only the data with the coming day (i.e.,data in size-1 window [ s , s + ) , where s ∈ [ w , T − ] ) is available.Warm-start is performed for the incremental models with a modeltrained with batch mode before the first incremental step. That is tosay, we firstly train a warm-started model with batch mode on data in [ , w ) , then train the first incremental model on the data of w -thday. We set w = , T =

23 for Criteo dataset and w = , T =

59 forHuaweiApp dataset.

Table 1 presents the overall performance comparison over con-secutive test days. Models trained with batch mode with differentdelayed days Batch- i ( i ∈ [ , ] ) are also compared here. Specifically,Batch-0 represents the model that is tuned on latest data (servingas validation) and then fine-tuned with latest data (serving as traindata), namely training process is carried out twice. Batch-0 reachesthe performance upper bound of batch mode, however, it is notfeasible in practice as it doubles the training time of batch mode.Relative improvement of the best incremental model over the othermodels in terms of AUC is reported in “Impr.” column. From thistable, we have the following observations. • On Criteo dataset, incremental models achieve better ef-fectiveness than baselines trained with batch mode whichaccelerates training procedure significantly. Specifically, in-cremental models outperform all the baselines with signifi-cant improvement (more than 0.1%) over Batch-1 to Batch-5.Surprisingly, IncCTR achieves comparable performance toBatch-0 (which is an ideal model with upper bound perfor-mance). • On HuaweiApp dataset, consistent results are acquired. In-cremental models obtain comparable performance with base-lines with huge efficiency improvement. Negligible decreaseexists when comparing the performance of IncCTR withBatch-1 which can be ignored in practice. Unsurprisingly,Batch-0 utilizing the entire dataset outperforms the othermodels. Nonetheless, as stated earlier, Batch-0 is infeasiblein practice as it doubles the training time in batch mode. • On both datasets, severe performance degradation is ob-served as the delayed period extends, which calls for efficienttraining method. In industrial scenarios, updating delay of1-3 days is a common phenomenon if the model is trainedwith batch mode in the single device which imputes to theenormous data volume, tedious preprocessing, cumbersomemodel structure and so on. Longer updating delay is alsopossible in some more complicated settings when multi-taskor ensemble are needed. We can see that, with incrementallearning method IncCTR, the improvement of performanceis quite significant when model updating delay occurs, forinstance, AUC improves 0.6% and 0.4% when 5-day delayexists in HuaweiApp dataset and Criteo dataset (as shownin Figure 1).

To validate the contribution of feature module and model modulein IncCTR, we do some ablation studies over such two modules. • Feature Module. In feature module, if the number of oc-currences of a new feature from incoming data is abovethe threshold, we would assign an individual feature id tothis feature. To verify the usefulness of the new feature ex-panding strategy, we conduct experiments to compare theeffectiveness of whether expanding the new features during

Practical Incremental Method to Train Deep CTR Models KDD’20, Aug 22–27 2020, San Diego, California

Table 1: Overall performance comparison between IncCTR and training with batch mode over consecutive days on Criteoand HuaweiApp datasets. Mean AUC and Logloss over consecutive days are reported. The underlined numbers represent theperformance of the best variant of IncCTR. Beside performance, average epochs and average train time (sec) for updating anew model are also presented for efficiency comparison.

Criteo HuaweiAppmodel AUC Logloss avg epochs avg time (s) Impr. AUC Logloss avg epochs avg time (s) Impr.Batch-0 0.7977 0.5438 18 32694.66 0.06% 0.8543 0.0859 17.34 91711.26 -0.14%Batch-1 0.7956 0.5464 9 16347.33 0.33% 0.8532 0.0861 8.67 45855.63 -0.01%Batch-2 0.7946 0.5476 9 15907.42 0.45% 0.8523 0.0863 7.7 39697.77 0.09%Batch-3 0.7939 0.5485 8 14020.88 0.54% 0.8514 0.0866 7.8 44587.14 0.20%Batch-4 0.7932 0.5492 10 17718.42 0.63% 0.8496 0.0870 7.57 43478.67 0.41%Batch-5 0.7925 0.5501 9 15853.46 0.72% 0.8475 0.0874 7.73 39030.66 0.66%IncCTR-Fine-tune

Table 2: Feature Module: w/o new features V.S. with new fea-tures (AUC improvement). model Criteo HuaweiAppw/o new features - -with new features +0.08% +0.055%incremental learning over the two datasets. As shown in Ta-ble 2, the consistent performance degradation demonstratesthe necessity of expanding new features. Specifically, be-cause there are more new features per day on Criteo dataset,the strategy of new features expanding has greater impacton this dataset. Therefore, the performance of IncCTR onCriteo dataset drops greater than that on HuaweiApp datasetwhen new features are not considered. Besides, such decreasebecomes more severe as the model keeps incrementally train-ing, as shown in Figure 5, where more than 0 .

1% performancedecrease is observed eventually. • Model Module. Fine-tune and knowledge distillation areapplied in the model module. Compared with fine-tune,two KD methods (KD-batch and KD-self) achieve similarperformance on Criteo and slight better performance onHuaweiApp dataset. This may because of too few new fea-tures emerging in new data. On average, only about 6% newfeatures arising each day in Criteo dataset, while similar phe-nomenon happens in HuaweiApp dataset. Comparing twoKD methods, their performance are very close to each other,which suggests that KD-self is a better choice in practice asno extra computational resource is needed for training theteacher model.

Average epochs and training time (sec) of different models aresummarized in Table 1. When training with batch mode, one epochmeans going through over all data within the fixed-size window,while training with incremental mode, it means going through theincoming data. Tremendous advantage on efficiency is revealed,as we are able to get 60x and 270x improvement over average

Figure 5: Performance decrease when new features are notexpanded in Feature Module of IncCTR, over consecutivedays on Criteo dataset. training time on Criteo and HuaweiApp dataset respectively, whichis extremely helpful in practice.

In this paper, we propose the IncCTR, which is a practical incremen-tal method to train deep CTR models. IncCTR includes data module,feature module and model module. Specifically, we propose new fea-tures expanding and initialization strategies in feature module, andpropose several training algorithms in model module. Comprehen-sive experiments are conducted to demonstrate the effectivenessof the two modules in incCTR. Compared with conventional batchmode training, our method can achieve comparable performancein terms of AUC but with much better training efficiency, which isextremely helpful in practice.There are two interesting directions for future study. One isinvestigating the novel approach to utilize history data to guaranteethe stability of incremental training, such as reservoir. The otherone is how to update (add or remove) the features incrementally andefficiently in the production environment of recommender system.

DD’20, Aug 22–27 2020, San Diego, California Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, Xiuqiang He

REFERENCES [1] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,et al. 2016. Wide & Deep Learning for Recommender Systems. In

DLRS@RecSys .ACM, 7–10.[2] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networksfor youtube recommendations. In

Proceedings of the 10th ACM conference onrecommender systems . ACM, 191–198.[3] Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and An-ima Anandkumar. 2018. Born Again Neural Networks. arXiv:stat.ML/1805.04770[4] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.Deepfm: a factorization-machine based neural network for ctr prediction. In

IJCAI . 1725–1731.[5] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, and Stuart Bowers. 2014. Practical Lessons from PredictingClicks on Ads at Facebook. In

Eighth International Workshop on Data Mining forOnline Advertising . 1–9.[6] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowl-edge in a Neural Network.

CoRR abs/1503.02531 (2015). http://arxiv.org/abs/1503.02531[7] Zhizhong Li and Derek Hoiem. 2016. Learning Without Forgetting. In

ComputerVision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands,October 11-14, 2016, Proceedings, Part IV (Lecture Notes in Computer Science) ,Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9908. Springer,614–629.[8] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, andGuangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit FeatureInteractions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).[9] Bin Liu, Ruiming Tang, Yingzhi Chen, Jinkai Yu, Huifeng Guo, and Yuzhou Zhang.2019. Feature Generation by Convolutional Neural Network for Click-ThroughRate Prediction. In

The World Wide Web Conference, San Francisco, CA, USA, May13-17 . ACM, 1119–1129. [10] H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, SharatChikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos,and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In

ACMSIGKDD . https://doi.org/10.1145/2487575.2488200[11] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.2016. Product-Based Neural Networks for User Response Prediction. In

IEEE16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016,Barcelona, Spain . ACM, 1149–1154.[12] Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, HuifengGuo, Yong Yu, and Xiuqiang He. 2019. Product-based Neural Networks for UserResponse Prediction over Multi-field Categorical Data.

ACM Trans. Inf. Syst.

ICDM . IEEE, 995–1000.[14] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross networkfor ad click predictions. In

ADKDD . ACM, 12.[15] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Networkfor Ad Click Predictions. In

Proceedings of the ADKDD’17, Halifax, NS, Canada,August 13 - 17, 2017 . ACM, 12:1–12:7.[16] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-fieldcategorical data. In

European conference on information retrieval . Springer, 45–57.[17] Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, MingmingSun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server forMassive Scale Deep Learning Ads Systems.

CoRR abs/2003.05622 (2020). https://arxiv.org/abs/2003.05622[18] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu,and Kun Gai. 2018. Deep Interest Evolution Network for Click-Through RatePrediction. arXiv:stat.ML/1809.03672[19] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In