[PDF] Field-aware Factorization Machines in a Real-world Online Advertising System

Abstract

Predicting user response is one of the core machine learning tasks in computational advertising. Field-aware Factorization Machines (FFM) have recently been established as a state-of-the-art method for that problem and in particular won two Kaggle challenges. This paper presents some results from implementing this method in a production system that predicts click-through and conversion rates for display advertising and shows that this method it is not only effective to win challenges but is also valuable in a real-world prediction system. We also discuss some specific challenges and solutions to reduce the training time, namely the use of an innovative seeding algorithm and a distributed learning mechanism.

Full PDF

FField-aware Factorization Machinesin a Real-world Online Advertising System

Yuchin Juan ∗ Criteo ResearchPalo Alto, CA [email protected] Damien Lefortier ∗ FacebookLondon, UK [email protected] Olivier Chapelle

GoogleMountain View, CA [email protected]

ABSTRACT

Predicting user response is one of the core machine learningtasks in computational advertising. Field-aware Factoriza-tion Machines (FFM) have recently been established as astate-of-the-art method for that problem and in particularwon two Kaggle challenges. This paper presents some re-sults from implementing this method in a production systemthat predicts click-through and conversion rates for displayadvertising and shows that this method it is not only eﬀec-tive to win challenges but is also valuable in a real-worldprediction system. We also discuss some speciﬁc challengesand solutions to reduce the training time, namely the use ofan innovative seeding algorithm and a distributed learningmechanism.

1. INTRODUCTION

Online advertising is a major business for Internet compa-nies and one of the core problem in that ﬁeld is to be able tomatch the right advertisement to the right user at the righttime. Accurate click-through rate prediction is essential forsolving that problem and has been the topic of extensiveresearch, both for search advertising [11, 20] and displayadvertising [5, 14]. Performance based advertisers measurethe performance of their campaigns not only with respect toclicks, but also to conversions – deﬁned as a user action onthe website such a purchase – and speciﬁc machine learningmodels have been developed for conversion prediction [15,23, 3, 26].A prominent model for these prediction problems is lo-gistic regression with cross-features [20, 5]. When all cross-features are added, the resulting model is equivalent to apolynomial kernel of degree 2 [2]. A Kaggle challenge washosted by Criteo in 2014 to compare CTR prediction algo-rithms. Logistic regression with cross-features was indeedquite successful in that competition: the 3rd place winnersolution was based on this technique [24]. But the winning ∗ Contributed equally to this work. (cid:13) WWW’17 Companion,

April 3–7, 2017, Perth, Australia.ACM 978-1-4503-4914-7/17/04.http://dx.doi.org/10.1145/3041021.3054185. solution is a variant of factorization machines [22] called

Field-aware Factorization Machines (FFM) [14]. The im-pressive performance of FFM prompted us to implement itand test it as part of our production system.

FFM.

Consider the case of categorical features – most features inad systems are either categorical or can be made categoricalthrough discretization. Let F be the number of features (orﬁelds) and v , . . . , v F be the values of these features for agiven example. The FFM prediction on this example can bewritten as: F (cid:88) f =1 F (cid:88) f = f +1 w i · w i , where i = Φ( v f , f , f ) , i = Φ( v f , f , f ) , (1)with w ∈ R d × k the weight matrix and w i ∈ R k denotesthe embedding of the i -th entry. The mapping Φ( v, f , f )maps a value v of feature f in the context of feature f toan index from 1 to d . This may be any hash function orbased on dictionary. In the latter case, d will be equal to F × (cid:80) Ff =1 c f , with c f the cardinality of the f -th feature.In regular factorization machines, there is a unique embed-ding for a given feature value; in other words, the indices in(1) for FM are i = Φ( v f , f ) and i = Φ( v f , f ). But in ﬁeld-aware FM, there is a diﬀerent embedding depending onthe other feature of the dot product. As argued in [14] thisgives additional modeling ﬂexibility.

Related work.

A similar eﬀort to ours has been reported by AdRoll ina blog post: the author reports substantial gains after de-ploying FMs in their CTR prediction system. Google [20]and Facebook [12] may not use FMs but have reported somespeciﬁc challenges they encountered in productionizing theirlarge scale CTR prediction system, which are related to thechallenges for productionizing FFM. Factorisation Machinesupported Neural Network (FNN) and Sampling-based Neu-ral Network (SNN) [28] are two learning algorithms relatedto FMs that have also been applied a CTR prediction task.They are both deep neural networks but diﬀer in their em-bedding layer: SNN uses a regular embedding layer while The prediction here is speciﬁc to categorical features while[14] handles the more general case of continuous features. http://tech.adroll.com/blog/data-science/2015/08/25/factorization-machines.html a r X i v : . [ c s . L G ] F e b NN is initialized with the result of a factorization machine.The recent interest in factorization machines have led to thedevelopment of distributed solvers [18] for these techniques.Finally a hierarchical version of factorization machines hasbeen introduced in [21].Even though FFM have been shown to be a state-of-the-art method for computational advertising by winning twoKaggle challenges, it is still unclear if they are well suitedin a production environment. The Netﬂix challenge is a re-minder that a production system has some speciﬁc set ofconstraints and goals that diﬀer from the ones of an aca-demic competition: ultimately Netﬂix decided not use thewinning solution. This paper discusses our attempt at implementing FFMin a production system that predicts click-through and con-version rates on display advertisements. Section 2 presentsoﬄine and online (A/B test) results and provides some in-sights on the beneﬁts of this method over standard logisticregression as well as the challenges for using FFM in a pro-duction system. These positive results further led us to ad-dress one of the main bottleneck encountered with our FFMimplementation: training speed. Section 3 investigates howto train FFM in a distributed environment. And Section4 oﬀers an innovative model seeding procedure to furthersolve that problem, resulting in a more accurate model witha shorter training time and using less computation resources.Finally Section 5 presents conclusions and future work.

2. FFM IN A PRODUCTION SYSTEM

In this Section, we describe how we use FFM in our pro-duction system, present our oﬄine and online results, anddiscuss the beneﬁts and challenges of using FFM in such asetting.

As discussed in Section 1, state-of-the-art advertising sys-tems are based on click-through rate (CTR) and conversionrate (CR) prediction models. In this paper, we consider bothCTR and CR prediction models used for bidding in real-timeauctions (see, e.g., [5, 26]). To predict the probability of asale given a display, we use a multiplicative model betweena model of the probability of a click given a display and amodel of the probability of a sale given a click, as discussedin [3]. So, in the rest of the paper we call these two modelsCTR and CR.Our baseline system for training these models is basedon previous work [1, 5, 26]. Indeed, following [1, 5], weuse the hashing trick [27] to reduce the dimensionality ofour data and to thus reduce the number of parameters toﬁt. We use logistic regression (LR) with cross-features ﬁttedwith L-BFGS warm-started using SGD [1, 5]. Following [26],we also use cost-sensitive learning for the CR model andweight each sale depending on the value of the sale for theadvertiser, as this was shown to increase the performancefor the CR model both oﬄine and online. We use HadoopAllReduce for distributing the learning of our models [1].Below, we investigate the usage of FFM instead of LRfor training our CTR and CR prediction models. We stilluse the hashing trick. So, the mapping Φ( v, f , f ) in (1) is http://techblog.netﬂix.com/2012/04/netﬂix-recommendations-beyond-5-stars.html based on a hash which a ﬁxed hashing space (of the orderof tens of millions). We now present results comparing FFM to the state-of-the-art baseline on an oﬄine dataset.

Ofﬂine metrics.

We use two oﬄine metrics. First, we use the normalizedlog loss (NLL). This metric shows the relative improvementin log loss (LL) of the model to be evaluated versus a baselinepredictor, in our case the average empirical CTR or CR ofthe dataset, similar to the normalization in [12, 16, 26]. Thismetric is deﬁned formally for any prediction p as follows,where we denote ¯ p the best constant predictor on the testset and N the number of impressions in our dataset. LL ( p ) = − N (cid:88) i =1 y i log( p i ) + (1 − y i ) log(1 − p i ) (2) NLL ( p ) = LL(¯ p ) − LL( p )LL(¯ p ) (3)We also use the Utility metric [4, 26], which allows tomodel oﬄine the potential change in proﬁt due to a predic-tion model change. Since the observed proﬁt in historicaldata is ﬁxed, this metric makes the assumption that the dis-play costs are determined by the highest second bids comingfrom a second price auction and that they are generated ac-cording to a distribution conditioned on the observed displaycost. This metric is deﬁned as follows, where v i is the rewardof the i th impression. Utility = (cid:88) i (cid:90) p ( x i ) v i ( y i · v i − ˜ c ) Pr(˜ c | c i ) d ˜ c (4)The distribution Pr(˜ c | c ) speciﬁes what could have beenthe second price instead of the observed cost c ; [4] suggestsa Gamma distribution with α = βc + 1 and free parameter β . The motivation for selecting this distribution is that itinterpolates nicely between two limit distributions: a Diracdistribution centered at c (as β → + ∞ ) and a uniform dis-tribution (as β → Experimental setup.

We use internal data from Criteo to do our experiments.Note, however, that, as discussed in Section 1, FFM havebeen shown to be better than existing methods on manypublic data sets already [14]. Moreover, the goal of thissection is to show we can improve upon our baseline usingFFM in a real-world online advertising system, which usesits own data. We need oﬄine experiments to ensure thatFFM are performing well in our system (both in terms ofpredictive performance and scalability) and for parametertuning, before we can perform a live experiment (A/B test).We use a variant of progressive validation, similar to [20],for our experiments. The day following the training periodserves as a validation set. As shown on Figure 1 below, This metric is called expected

Utility in [4], but we refer toit as Utility in this paper.he process is repeated N times, shifting the learning period(indicated by ”tr”) by 1 day at each step. The ﬁnal resultsare the average metrics over all the test sets (indicated by”te”). tr Latency & memory consumption.

One potential drawback of using FFM in a productionsystem is that they require more CPU time for inference [14].This may lead to increased latency online when respondingto bid requests and therefore come more timeouts. FFM alsorequire more memory for storing the model as the numberof latent factors and/or the number of ﬁelds increase, whichmay lead to a much larger memory consumption than LR.To solve the memory issue, we propose to reduce the size ofthe hashing space of FFM models (compared to our baseline)so that FFM models have the same size as the LR models(the exact value depends on the number of ﬁelds and on thenumber of latent factors). Note that if we had not reducedthe size of the hashing space, but kept it constant, the sizeof FFM models would be more than a

100 times larger thanour baseline, which would make it impractical. Therefore,in the results below, FFM and LR models have the samenumber of parameters (unlike in [14]).To solve the latency issue, we propose to reduce the num-ber of latent factors as much as possible without signiﬁcantlydegrading the performance of FFM. Using these two solu-tions, FFM and LR consume the same amount of memoryand we can limit the impact on latency to handle the re-quirement of our production system, as we will see below.

Ofﬂine results.

We compare LR and FFM on our CTR and CR predic-tion tasks in terms of NLL (Table 1) and Utility (Table 2).FFM achieves signiﬁcantly better results with a large eﬀectcompared to LR, both in terms of NLL and of Utility forour CTR model, thus conﬁrming the results from [14] onour data. We also observe large gains on our CR model,thus extending the results from [14] to CR models on all ouroﬄine metrics.We also observe that the improvements are even largeron small advertisers, which represent a signiﬁcant portion ofour traﬃc, for both our CTR and CR models on all metrics.Our hypothesis to explain these results has to do with sparsedata and unobserved cross-features: LR is unable to predictthe value associated with a cross-feature that is not partof the training data; on the other hand, FFMs are able to better generalize through their latent representation (see de-tailed explanation and example in [14, Section 2]). For largeadvertisers LR has enough data to learn a good model, butfor small advertisers FFMs handle this data sparsity issuebetter than LR.During the tuning of the hyper-parameters, we observedvery similar results as in [14] in terms of performance w.r.teach hyper-parameter. The most important parameter isthe number of epochs and we use early-stopping to auto-matically tune it.We also investigated the prediction time of FFM com-pared to the baseline model, which is expected to increasedespite the fact that we constrained our FFM models to beof the same size of the baseline. This is because the numberof operations to compute the prediction (1) is O ( F k ) whileLR with all cross-features requires only O ( F ) operations.We observed that the slowdown of FFM is indeed propor-tional to the number of latent factors k . It turns out that k = 2 is a good trade-oﬀ: it hardly degrades the accuracyresults compared to the results above which were obtainedwith 4 latent factors (0.1% in NLL) and a 2x in predictiontime in our system is acceptable since prediction is is notthe most time-consuming part of processing a request (com-pared to extracting raw features, pre-processing them, etc.). As the oﬄine results were quite promising, we decided torun an A/B test using FFM for both CTR and CR pre-dictions models. Although FFM require more time for theinference (see above), we did not observe any signiﬁcant im-pact on our timeouts while serving live traﬃc. So, we wereable to A/B test FFM on a large portion of our live traﬃc.This A/B test served ∼

5B displays ( ∼ synchronously sincediﬀerent refresh rates might bias the results. Even withmulti-threading, the learning time of FFM is indeed muchhigher than for our distributed optimization baseline. InSection 3 and 4, we will see how to reduce this learning time,but now we focus only on the performance improvements wecan get online with FFM. The results we obtained are thefollowing.Results are shown in Table 3. We observed an increase inthe number of displays (+4 . . (cid:78) .Prediction model with FFM NLL NLLon all advertisers on small advertisersCTR +3 . (cid:78) +5 . (cid:78) CTR + CR +1 . (cid:78) +6 . (cid:78) Table 2: Oﬄine relative comparison between Logistic Regression (baseline) and FFM on our CTR and CR prediction tasksin terms of Utility metrics (4). We report the Utility of our model for the expected number of sales given display, which usesour CTR and CR models as sub-models. Statistical signiﬁcance is indicated by (cid:78) .Prediction model with FFM Utility β =10 Utility β =10 Utility β =1000 Utility β =1000 on all advertisers on small advertisers on all advertisers on small advertisersCTR +6 . (cid:78) +9 . (cid:78) +2 . (cid:78) +4 . (cid:78) CTR + CR +11 . (cid:78) +38 . (cid:78) +5 . (cid:78) +18 . (cid:78) Our positive online results motivate us to use FFM inproduction instead of LR. To do so, the code change is rathersmall if SGD is already available. However, there are a fewchallenges to keep in mind when using FFM instead of LRin a production system.The main concern with rolling out FFM is the learningtime, which is much higher than the baseline as discussedbefore. This means that our models would be refreshed lessoften with FFM, at the cost of reducing the performanceof the system. All our oﬄine experiments to improve ourmodels would also take much longer. This is not acceptable and we will discuss in the next two sections how to tackle thisproblem to handle the scale of a large production system, inparticular by distributing the learning on multiple machines.There are also other challenges. Above, we discussed thememory consumption and prediction latency issues and weshowed how to manage them. Another potential problem isthe non-convexity of the objective function of FFM, whichmay lead to some instability in the performance of FFM dueto local minimums. To investigate this, we learned multipleFFMs on the same dataset initialized with random weightsas in [14]. We observed that all the models have similar per-formance ( ± .

05% of NLL) despite the diﬀerent initializa-tions. The local minimum issue is thus not a major concern.We also saw above that the number of hyper-parametersin FFM is larger than for LR with the addition of the learn-ing rate (as we use L-BFGS for training our LR models) andof the number of latent factors, while we only had the regu-larization parameter to tune for LR. This means that tuningtakes more time when improving our models. However, andas discussed in [14], this is not a major problem for multiplereasons. First, the performance is not very sensitive to thenumber of latent factors and to the regularization parame-ter, while a good value for the learning rate is easy to ﬁnd.We also found the performance of FFM to be stable overtime w.r.t to the hyper-parameters (no need for constantre-tuning).As we have not been able to ﬁnd a satisfying regularizerfor FFM, we use early-stopping to avoid over-ﬁtting [14]—the only solution we have. So, some monitoring should also be added to ensure that we are not under-ﬁtting or over-ﬁtting despite using early stopping (e.g., if the small amountof data used for testing and deciding when to stop is notrepresentative).Note ﬁnally that for eﬃcient regression testing [16], weneed to ﬁx the seed used for randomizing the initial weights[14].

3. A SIMPLE DISTRIBUTED SETTING

In the previous section, we discussed that the trainingtime of FFM is too slow to meet our production requirement,even after applying the parallelization approach mentionedin [14] on a multi-core machine.To get more speed-up, a natural option is to train FFMon a distributed system. Generally speaking, for sequentialalgorithms such as SGD or dual coordinate descent, the con-vergence of their parallelization depends on how often eachworker can access the model. In shared-memory systems,because each thread can access the model in real-time, it ispossible that the convergence remains the same, as shown in[14]. However, in distributed systems, where we need to usethe network for communication, we can no longer share themodel among machines in real-time (due to network over-head). There are two main ways of distributing a stochasticgradient algorithm, synchronously and asynchronously . Inboth cases, each machine has a subset of the data and itsown local model and it updates the global model after abatch of data points has been processed. The asynchronoustraining is often referred to as the parameter server approach[17, 18, 7]: some machines are dedicated to storing the globalmodel and the workers are continuously reading and updat-ing that model with their local model. The synchronoustraining on the other hand is referred to as iterative param-eter mixing (IPM) [19, 29, 1]: all the models are averagedafter a certain amount of data has been processed (e.g. ev-ery epoch). From an engineering point of view, simplicity isone of the most important factors we consider when choos-ing an algorithm. A complicated algorithm requires moretime for development, is harder to maintain, and is morelikely to introduce bugs. Therefore, in practice, if a simpleralgorithm can solve our problem, we would not go for a moreable 3: Online relative comparison between Logistic Regression (baseline) and FFM on our CTR and CR prediction modelsin terms of Return On Investment (ROI), i.e. advertiser value over cost, during our A/B test. Statistical signiﬁcance isindicated by (cid:78) . Prediction model with FFM ROI ROIon all advertisers on small advertisersCTR + CR +0 . (cid:78) +2 . (cid:78) Algorithm 1

Iterative Parameter Mixing (IPM) for Ada-Grad1: Split m data points across k machines2: Initialize w

3: Initialize G i ← I ∀ i ∈ { , · · · , k } for t ∈ { , · · · , T } do (cid:46) T : number of epochs5: Let w i ← w ∀ i ∈ { , · · · , k } for i ∈ { , · · · , k } parallel do for each data point do

8: Calculate the gradient g

9: Update G i : G i ← G i + diag( gg T )10: Update w i : w i ← w i − ηG − / i g w ← (cid:80) ki =1 w i /k complicated one. As we will see, with IPM we are alreadybe able to speed-up the training time 12x with 32 machines.This already meets our requirement, so we do not investigatethe parameter server approach in this paper. IPM for theAdaGrad learning algorithm [9] is described in Algorithm 1.The speed-up of a distributed algorithm can be modeledby the following equationspeed-up = × η is 0.2.also needs 20 times more epochs. Therefore, the speed-upis only 32 / (157 / ≈ .

6. A natural way to make the con-vergence faster is to increase the learning rate η . Thoughincreasing the learning rate indeed makes the algorithm con-verge faster, it also make the log loss worse. This result isshown in Table 5a. η (a) Algorithm 1 η (b) Algorithm 2 Table 5: With 32 machines, the number of epochs requiredto reach the best log loss with diﬀerent learning rates.We propose the following approach to solve this issue. Re-member that, following [14], we use AdaGrad [9] to boost theperformance of SGD. AdaGrad records the squared gradientsum ( G ) to dynamically adjust the learning rate for eachdimension. In Algorithm 1, G is not synchronized amongmachines. It may make G on each machine very small andmake the eﬀective learning rate too large. Based on an ideasimilar to [1], we aggregate G among each machine at theend of each epoch. This new algorithm is described in Algo-rithm 2. The experiment result is shown in Table 5b. Thelog loss is much better when a large learning rate is used.Under this setting, if we choose η = 3 .

0, the speed-up wecan achieve is 32 × (8 / ≈

12. Indeed, after we applied thissetting in our system, we observe a similar speed-up, whichenables us to train a model as fast as our current system.

4. WARM-START

As described in Section 2, we regularly re-train models.In Figure 1, suppose each training set contains several daysof data, and we move a few hours forward at each step, lgorithm 2

Improved IPM for AdaGrad1: Spread m data points into k machines2: Initialize w

3: Initialize G ← I for t ∈ { , · · · , T } do (cid:46) T : number of epochs5: Let w i ← w ∀ i ∈ { , · · · , k }

6: Let G i ← G ∀ i ∈ { , · · · , k } for i ∈ { , · · · , k } parallel do for each data point do

9: Calculate the gradient g

10: Update G i : G i ← G i + diag( gg T )11: Update w i : w i ← w i − ηG − / i g w ← (cid:80) ki =1 w i /k G ← (cid:80) ki =1 G i Algorithm 3

A naive warm-start

Require: an initial model w w ← w calculate the validation loss L for t ∈ { , . . . , T } do update ww t ← w calculate the validation loss L t if L t > L t − thenreturn w t − there will be a large amount of overlap between training sets Warm-start only in-ﬂuences the convergence speed. However, this is not thecase for FFM. To explain why, we ﬁrst review an undesiredproperty of FFM that has been investigated in [14] – we donot have a good regularization method for FFM, and henceneed to rely on early-stopping to prevent over-ﬁtting. Wevisualize this property in Figure 2. To obtain the best testaccuracy, the number of epochs must be carefully selected– with insuﬃcient epochs, the model can be under-ﬁtting;on the other hand, with too many epochs, the model canbe over-ﬁtting. To determine the best number of epochs,we usually use a validation set to monitor the model perfor-mance at each epoch. Once the validation loss goes up, westop the training process. We deﬁne three phases to indicatethe “maturity” of the model. • Pre-mature: the model is trained with too few epochs • Mature: the model is trained with enough epochs • Post-mature: the model is trained with too many epochsThe use of early stopping, however, makes warm-start dif-ﬁcult to be applied. If we seed a mature model to the nextstep and keep training, then the new model can be post-mature. This problem can be demonstrated in the following Assuming an appropriate optimization method and a tightstopping criteria are applied. epochs l o ss pre-mature mature post-maturetest losstraining lossFigure 2: An illustration of over-ﬁtting problem.Figure 3: The test log loss of using FFM with diﬀerent seed-ing approaches. The y-axis is the diﬀerence of log loss com-pared with the baseline (FFM without warm-start).experiment. We again use Criteo’s CTR Prediction Chal-lenge dataset for reproducibility. We split the data set into90 blocks, and at each step, 44 blocks are used for training,1 block for validation, and 1 block for test. Therefore, theentire experiment starts from the 46th block (as test set),moves one block forward at each step, and ends at the 90thblock (as test set). The validation set is used to determinethe number of epochs. We ﬁrst compare a baseline setting,which did not use any warm-start approach, with a naive warm-start described in Algorithm 3, which simply seedsthe model obtained in the end of each step into the nextstep.The experiment result shown in Figure 3 indicates thatthe post-mature problem indeed happens seriously – the testaccuracy is getting worse and worse when the experimentsmove forward. Again note that the goal of a warm-starttechnique is to reduce training time while keep the samepredictability of the model. Clearly, by using a naive warm-start for FFM, this goal is not achieved.In this paper, we propose a new warm-start approachnamed pre-mature warm-start. The idea is that instead ofseeding a mature model to the next step, a pre-mature modelis used as the seed. At each step, since the new model is ini-tialized with a pre-mature model, it may be able to learn lgorithm 4 Our proposed “pre-mature” warm-start

Require: an initial model w − w ← w ← w − calculate the validation loss L for t ∈ { , . . . , T } do update ww t ← w calculate the validation loss L t if L t > L t − thenreturn ( w t − , w t − )Figure 4: Number of epochs used in eash step. Both settingsuse 44 blocks of training data.from the new data without over-ﬁtting to the old data. Forexample, if the mature model comes at the 6th epoch, thenthis model will be used for prediction, but the model ob-tained at the 5th epoch will be seeded to the next step.The algorithm of pre-mature warm-start is described in Al-gorithm 4. Here, w t − is used for prediction and w t − isseeded. Ofﬂine results.

The experiment results in Figure 3 and 4 show that withpre-mature warm-start, the test performance is not worsethan the baseline any more, and the number of epochs re-quired is signiﬁcantly reduced.It is noteworthy that the log loss of FFM with warm-start is getting lower as the experiment moves forward. Thissuggests that FFM may have some ability to remember theinformation learnt in the past. Inspired by this observation,we tried reducing the size of training set. Figure 5 shows thecomparison among diﬀerent training sizes with pre-maturewarm-start. We see that after suﬃcient number of steps,pre-mature with only 4 blocks of training set is still betterthan the baseline using 44 blocks. By using smaller trainingset, the training becomes much faster. The comparison oftraining time is shown in Table 6. If we use 4 blocks fortraining, then it is 20 times faster than the baseline.An extreme case is to reduce the size of training set toonly one block. In this case, because there is no overlapbetween two consecutive steps, we do not have to use pre- Figure 5: The log loss diﬀerence between baseline and diﬀer-ent warm-start approaches and training sizes. The baseline(without warm-start) and pre-mature use 44 blocks as train-ing data at each step. Note that we change the training sizefrom the second step. For the ﬁrst step, all settings use 44previous blocks as training set, so the log loss are the same. online .tr

Discussion.

We have proposed two diﬀerent ways to reduce train-ing time. Distributed learning reduces the training timeby adding more machines, but at the same time also in-creases the amount of computation. (In our previous exper-iments, when 32 machines are used, we needed roughly 3times more epochs.) On the other hand, warm-start reduceshe training time by initializing a model wisely and requireless training epochs, which means the amount of computa-tion is decreased . In a sense, warm-start seems to be a bet-ter approach than distributed learning. However, we cannotcompletely replace distributed learning with warm-start, be-cause sometimes a cold-start is required, which means weneed to train an entirely new model. In practice, this canhappen when the code is updated or the system encountersunexpected error. In the cold-start scenario, we still need torely on distributed learning to make sure we can learn themodel on time.

5. CONCLUSION

In this paper, we showed that Field-aware FactorizationMachines can be successfully deployed in large scale ad-vertising system, and that it signiﬁcantly improves busi-ness metrics, in particular for small advertisers. One of thestrengths of FFM is indeed their ability to generalize betterthan logistic regression through their use of a latent repre-sentation.Further, we proposed two ways to make training FFMfaster: distributed learning and warm-start. The code forthe experiments in Section 3 and 4 is available online. Asfuture works, we plan to try our warm start method on ourother non-convex problems that are diﬃcult to regularize,such as a deep neural network.

6. REFERENCES [1] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford.A reliable eﬀective terascale linear learning system.

The Journal of Machine Learning Research ,15(1):1111–1133, 2014.[2] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang,M. Ringgaard, and C.-J. Lin. Training and testinglow-degree polynomial data mappings via linear svm.

Journal of Machine Learning Research ,11(Apr):1471–1490, 2010.[3] O. Chapelle. Modeling delayed feedback in displayadvertising. In

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery anddata mining , pages 1097–1105. ACM, 2014.[4] O. Chapelle. Oﬄine evaluation of response predictionin online advertising auctions. In

Proceedings of the24th International Conference on World Wide WebCompanion , pages 919–922. International World WideWeb Conferences Steering Committee, 2015.[5] O. Chapelle, E. Manavoglu, and R. Rosales. Simpleand scalable response prediction for displayadvertising.

ACM Transactions on Intelligent Systemsand Technology (TIST) , 5(4):61, 2014.[6] B.-Y. Chu, C.-H. Ho, C.-H. Tsai, C.-Y. Lin, and C.-J.Lin. Warm start for parameter selection of linearclassiﬁers. In

KDD , 2015.[7] J. Dean, G. S. Corrado, R. Monga, K. Chen,M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato,A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Largescale distributed deep networks. In

NIPS , 2012.[8] D. DeCoste and K. Wagstaﬀ. Alpha seeding forsupport vector machines. In

KDD , 2000. [9] J. Duchi, E. Hazan, and Y. Singer. Adaptivesubgradient methods for online learning and stochasticoptimization. JMLR , 12:2121–2159, 2011.[10] B. Efron and R. J. Tibshirani.

An introduction to thebootstrap . CRC press, 1994.[11] T. Graepel, J. Q. Candela, T. Borchert, andR. Herbrich. Web-scale bayesian click-through rateprediction for sponsored search advertising inmicrosoft’s bing search engine. In

Proceedings of the27th International Conference on Machine Learning(ICML-10) , pages 13–20, 2010.[12] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,A. Atallah, R. Herbrich, S. Bowers, et al. Practicallessons from predicting clicks on ads at facebook. In

Proceedings of 20th ACM SIGKDD Conference onKnowledge Discovery and Data Mining , pages 1–9.ACM, 2014.[13] P. Hummel and R. P. McAfee. Loss functions forpredicted click through rates in auctions for onlineadvertising.

Preprint, Google Inc , 2013.[14] Y. Juan, Y. Zhaung, W.-S. Chin, and C.-J. Lin.Field-aware factorization machines for CTRprediction. In

RecSys , 2016.[15] K.-c. Lee, B. Orten, A. Dasdan, and W. Li.Estimating conversion rate in display advertising frompast erformance data. In

Proceedings of the 18th ACMSIGKDD international conference on Knowledgediscovery and data mining , pages 768–776. ACM,2012.[16] D. Lefortier, A. Truchet, and M. de Rijke. Sources ofvariability in large-scale machine learning systems. In

Machine Learning Systems (NIPS 2015 Workshop) ,2015.[17] M. Li, D. G. Andersen, A. Smola, and K. Yu.Communication eﬃcient distributed machine learningwith the parameter server. In

Proceedings of the 27thInternational Conference on Neural InformationProcessing Systems , NIPS’14, pages 19–27,Cambridge, MA, USA, 2014. MIT Press.[18] M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang. Difacto:Distributed factorization machines. In

Proceedings ofthe Ninth ACM International Conference on WebSearch and Data Mining , pages 377–386. ACM, 2016.[19] R. McDonald, K. Hall, and G. Mann. Distributedtraining strategies for the structured perceptron. In

Human Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics , HLT ’10,pages 456–464, Stroudsburg, PA, USA, 2010.Association for Computational Linguistics.[20] H. B. McMahan, G. Holt, D. Sculley, M. Young,D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov,D. Golovin, et al. Ad click prediction: a view from thetrenches. In

Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery anddata mining , pages 1222–1230. ACM, 2013.[21] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, andM. Finegold. Predicting response in mobile advertisingwith hierarchical importance-aware factorizationmachine. In

Proceedings of the 7th ACM internationalconference on Web search and data mining , pages123–132. ACM, 2014.22] S. Rendle. Factorization machines with libFM.

ACMTransactions on Intelligent Systems and Technology(TIST) , 3(3):57, 2012.[23] R. Rosales, H. Cheng, and E. Manavoglu. Post-clickconversion modeling and analysis for non-guaranteeddelivery display advertising. In

Proceedings of the ﬁfthACM international conference on Web search anddata mining , pages 293–302. ACM, 2012.[24] G. Song. Criteo display advertising challenge.Available at ,2014.[25] C.-H. Tsai, C.-Y. Lin, and C.-J. Lin. Incremental anddecremental training for linear classiﬁcation. In

KDD ,2014.[26] F. Vasile, D. Lefortier, and O. Chapelle. Cost-sensitivelearning for utility optimization in online advertisingauctions. arXiv preprint arXiv:1603.03713 , 2016.[27] K. Weinberger, A. Dasgupta, J. Langford, A. Smola,and J. Attenberg. Feature hashing for large scalemultitask learning. In

Proceedings of the 26th AnnualInternational Conference on Machine Learning , pages1113–1120. ACM, 2009.[28] W. Zhang, T. Du, and J. Wang. Deep learning overmulti-ﬁeld categorical data. In

European Conferenceon Information Retrieval , pages 45–57. Springer, 2016.[29] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola.Parallelized stochastic gradient descent. In J. D.Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. S.Zemel, and A. Culotta, editors,