Targeted display advertising: the case of preferential attachment
Saurav Manchanda, Pranjul Yadav, Khoa Doan, S. Sathiya Keerthi
TTargeted display advertising: the case of preferentialattachment
Saurav Manchanda
University of Minnesota
Twin Cities, [email protected]
Pranjul Yadav
Criteo AI Lab
Palo Alto, [email protected]
Khoa Doan
Virginia Tech
Arlington, [email protected]
S. Sathiya Keerthi
Criteo AI Lab
Palo Alto, [email protected]
Abstract —An average adult is exposed to hundreds of digitaladvertisements daily , making the digital advertisement industrya classic example of a big-data-driven platform. As such, thead-tech industry relies on historical engagement logs (clicksor purchases) to identify potentially interested users for theadvertisement campaign of a partner (a seller who wants totarget users for its products). The number of advertisementsthat are shown for a partner, and hence the historical campaigndata available for a partner depends upon the budget constraintsof the partner. Thus, enough data can be collected for thehigh-budget partners to make accurate predictions, while thisis not the case with the low-budget partners. This skeweddistribution of the data leads to preferential attachment of thetargeted display advertising platforms towards the high-budgetpartners. In this paper, we develop domain-adaptation approachesto address the challenge of predicting interested users for thepartners with insufficient data, i.e., the tail partners. Specifically,we develop simple yet effective approaches that leverage thesimilarity among the partners to transfer information from thepartners with sufficient data to cold-start partners, i.e., partnerswithout any campaign data. Our approaches readily adapt tothe new campaign data by incremental fine-tuning, and hencework at varying points of a campaign, and not just the cold-start. We present an experimental analysis on the historicallogs of a major display advertising platform . Specifically, weevaluate our approaches across partners, at varying pointsof their campaigns. Experimental results show that the proposedapproaches outperform the other domain-adaptation approachesat different time points of the campaigns. Index Terms —digital advertising, ad-click prediction, domain-adaptation, transfer-learning, cold-start
I. I
NTRODUCTION
The digital advertising industry aims to identify potentiallyinterested users to show the product-related advertisementsfor a partner (a seller who wants to target users for itsproducts). It has grown to be one of the most importantforms of advertising as a consequence of the ubiquity of theinternet and the increasing popularity of digital platforms. Forexample, nearly 170 billion U.S. dollars were spent on digitaladvertising in 2015 and this figure is projected to add upto more than 330 billion U.S dollars by 2021 . With sucha growth rate, improving the advertisement experience for Fig. 1: Advertisements distribution of the partners, for arandom sample of partners, for a day. Less than one eighth(50 partners) of the partners drive of the advertisements.the partners and users is a valuable challenge for the digitaladvertising industry.The advertising platform pays to the publisher (the websiteon which the advertisement is displayed) for each adver-tisement it displays, but this investment is successful onlyif the user engages with the displayed advertisement. Thus,identifying the users who are most likely to engage with a targeted advertisement is fundamental to the digital adver-tisement industry. The digital advertising industry relies uponthe historical engagement-logs to identify the users likely toengage with a given advertisement. Thus, it is of paramountimportance to have sufficient and credible data to build accu-rate prediction models for user engagement prediction.The number of advertisements that can be shown for eachpartner, and hence the amount of training data available foreach partner, depends upon its budget. This results in a skeweddistribution of the data concerning the partners. Figure 1 showsthe distribution of advertisements in one day, for a randomsample of partners. Less than one eighth (50 partners)of the partners drive of the advertisements. For thepartners with sufficient budget, enough data can be collectedto make accurate predictions, while this is not the case withthe low-budget partners. As such, the digital advertisementindustry suffers from preferential attachment towards the high- a r X i v : . [ c s . L G ] F e b udget partners, and hence, unfairness towards the low-budgetpartners. Although the individual budget of the low-budgetpartners is very small compared to that of the high-budget part-ners, the number of low-budget partners is considerably larger.Thus, the aggregated budget of these low-budget partnersforms a considerable chunk of business opportunity. Hence,building approaches that are robust to preferential attachment is an important challenge for the ad-tech industry, from bothan ethical and business perspective.To address this challenge, we developed domain-adaptation approaches that leverage the similarity among the partnersto transfer information from the head partners to the similartail partners. In domain adaptation, we study two different(but related) domains, i.e., source and target. The domainadaptation task then consists of the transfer of knowledgefrom the source domain to the target domain. Specifically,our target domain is the non-campaign data, and the featuresin our target domain are the categories (such as electronics,apparel, etc.) in which a partner operates. Our source domainis the campaign data, in addition to the non-campaign data.The features corresponding to the campaign data are specifi-cally engineered to the task of targeted-advertising using thedomain knowledge, and these task-specific features are derivedfrom upon the user-advertisement engagement counts fromthe campaigns (for example, how many times has the userengaged with the advertisement of a partner in last month). Theprior domain adaptation approaches [1, 2] focus on learningcommon representations that are discriminative as well as, in-variant to the domains. However, among infinite such possiblerepresentations, the one that is closer to the source domainfeatures should be preferred. This is because any machinelearning algorithm depends upon the representation of theinput data, and the source domain features are engineered forthe task of targeted display advertising, thus are best-suitedfor the task. In this direction, we present two approachesthat instead of learning the common representations, directlyimpute the source domain features using the target domainfeatures. Our approaches assume that the partners with sim-ilar target domain representation should have similar sourcedomain representation as well. The two proposed approaches, Interpretable Anchored Domain Adaptation (IADA) and
LatentAnchored Domain Adaptation (LADA) differ in the mannerthat IADA directly imputes the observed features in the sourcedomain, while LADA imputes the features in the latent space,and hence is robust to the curse of dimensionality.We present an experimental analysis on the historical logs ofa major display advertising platform . Specifically, we evaluateour approaches across partners, at different points oftheir campaign, i.e., we experiment with varying amounts ofavailable data for the partners. Experimental results show thatthe proposed approaches outperform the baselines approachesat all points of the campaign, with LADA performing the best.Additionally, we also perform an extensive analysis of theproposed approaches on the special case of partner cold-start, i.e., when no historical data is available for a partner, and showthe advantage of the proposed approaches over the competingapproaches. For example, on the Mean Average Precisionmetric, LADA and IADA outperform the non-domain-adaptivebaseline by . and . at cold-start, respectively.II. R ELATED WORK
The prior research that is most directly related to the workpresented in this paper broadly spans the areas of domain-adaptation, transfer-learning, engagement prediction, and cold-start approaches. In this section, we review these areas.
A. Domain Adaptation and Transfer Learning
Domain adaptation and transfer learning are concernedwith accounting for changes in data distributions betweenthe training phase (source domain) and the test phase (tar-get domain). Both domain adaptation and transfer learninghave been used inconsistently and interchangeably within themachine learning literature. For this paper, we follow thedefinitions used in [3], which defines domain adaption as aninstance of transfer learning when the prediction task acrossthe domains is the same, but only the distribution of thedata in the two domains is different. Transfer learning, onthe other hand, refers to general-purpose knowledge transferacross domains and tasks. The domain adaptation methodscan be broadly categorized into supervised, semi-supervisedand unsupervised in consideration of labeled data of the targetdomain [4]. Domain adaptation has been made popular throughits applications in computer vision [1, 5, 6, 7, 8, 9, 10, 11] andnatural language processing [1, 12]. Domain adaptation andtransfer learning have been applied in the field of engagementprediction as well. One of the earliest works [13] derives atransfer learning procedure that produces resampling weightswhich match the pool of all examples to the target distributionof any given task. Su et al. [14] improves the click predictionby transferring information from the data-rich product to adata-scarce target product. Dalessandro et al. [15] uses theweb browsing data of the users as the source domain to predictuser’s engagement. They use the prior learned from the sourcedomain as a regularizer for the logistic regression on the targetdomain. Perlich et al. [16] uses the source domain to obtaintask-aware representations of high-dimensional web browsingbehavior and uses the learned representations to do predictionson the label-limited target domain. Aggarwal et al. uses there-targeting platform as the source domain and uses the largeamount of re-targeting data to cold-start the partners in theprospecting platform, which is their target domain. Althoughour work addresses the same challenge as [15, 16], we goa step further and use the data from the frequent-advertisedpartners to improve the performance on the target domain.
B. Engagement Prediction
Engagement prediction, such as
Click-Through Rate (CTR) prediction is the core challenge in targeted digital adver-tisement industry [18, 19]. Various linear and non-linearapproaches have been proposed for CTR prediction. PopularABLE I: Notation used throughout the paper.
Symbol Description X S Input space in the source domain. X t Input space in the target domain. Y Set of labels ( Y = { Purchase, NoPurchase } ). y A sample drawn from Y ( y ∈ Y ). g Domain transfer function to map target domain featuresin the discriminative new space. (can be target domainor domain-invariant space, depending upon the model) h Domain transfer function to map source domain fea-tures in the latent space. f Classification function that takes g ( x ) as input andpredicts if the user will engage with the advertisement. e Classification function that takes h ( x ) as input andpredicts if the user will engage with the advertisement. φ Parameters of the function gθ Parameters of the function fψ Parameters of the function hυ Parameters of the function e L aM Loss function for estimating the function(s) a for themodel M . α Hyperparameter controlling the contribution of differ-ent loss functions in a multi-objective model. approaches such as logistic regression [18, 20], log-linearmodels [21] and decision trees [22] have shown decent per-formance in practice. Non-linear approaches to model thefeature interactions include Factorization Machines (FMs) anddeep learning approaches. Examples of factorization machinesbased approaches include [23, 24, 25]. Examples of deeplearning approaches include the ones that model higher orderinteraction terms [26, 27, 28, 29, 30], sequential models [30,31, 32], and multimedia content based models [33, 34]. Themultimedia-based models exploit newer sources of structureddata like images and text in addition to traditional features.The other significant areas of research have been with usingthe keyword queries for optimization of the CTR in searchand retrieval settings [35, 36, 37].
C. Cold-start
Cold-start is the problem of being able to make predictionsin the absence of data from the entity of interest. A prominentamount of work has been done in recommender systemsand information retrieval to address the cold-start problem,examples of which include [38, 39, 40, 41, 42, 43, 44].As discussed in Section II-A, some approaches [15, 16, 17]have been proposed for cold-start problems in targeted digitaladvertising too. However, compared to these approaches, wego a step further and use the data from the frequent-advertisedpartners to improve the performance on the target domain.III. P
ROBLEM S TATEMENT AND N OTATION
We address the broad challenge of predicting if a userwill purchase the product that is advertised to him/her ornot. This is essentially a binary classification task, and weseek to estimate a mapping advertisement → {
Purchase,No Purchase } . Each advertisement is part of some partner’sadvertisement campaign. The features for each advertisement are derived from the partners’ prior campaign data and arespecifically engineered to the task of targeted-advertising.Specifically, these engineered features are derived from theuser-partner engagement counts (for example, the number oftimes the user has engaged with the partner in the last month).As such, there is not enough data to confidently estimate theseengineered features for the tail partners. The specific case ofcold-start corresponds to zero amount of data to estimate thesefeatures. Hence, estimating a classifier using these engineeredfeatures will be biased towards the head partners.We seek to improve the prediction performance for the tailpartners, at different time points of their campaign. To achievethis, we present domain-adaptation approaches that leveragethe similarity among the partners to transfer informationfrom the head partners to the similar tail partners. Formally,the domain adaptation task then consists of the transfer ofknowledge from the source domain to the target domain. Inour particular setting, the target domain is the non-campaigndata that correspond to the categories in which the partnersoperate (such as electronics, apparel, etc.). The source domain consists of campaign data, in addition to the category data.As discussed earlier, the features for the campaign-data areengineered specific to the task and hence are best suited topredict the users’ engagement. We denote the input spacecorresponding to the source domain as X S , and that to thetarget domain as X T . For simplicity, we assume that the X S and X T share the same feature space, but the probabilitydistribution P ( X S ) (cid:54) = P ( X T ) . This simply corresponds tohaving zeros corresponding to the campaign data features inthe target domain (since the only difference between the sourceand the target domain is the lack of campaign data featuresin the target domain). The approaches presented in this paperestimate a transformation function g ( · ) with parameters φ thattakes as input a representation x of an advertisement ( x can besampled from either X S or X T ), and outputs an representation g ( x ) . The transformation function g ( · ) is responsible forperforming the domain-adaptation, by encoding the sourcedomain information in the target domain. The representation g ( x ) is then given as an input to another function f ( · ) withparameters θ , which outputs the probability P ( y | x ) , where y ∈{ Purchase, No Purchase } .The approaches presented in this paper first estimate abase model ( g ( · ) and f ( · ) ) specifically to cold-start the part-ners. The base model is then incrementally updated as thecampaign-data comes up. Thus, how well the fine-tuned modelworks during the later stages of a campaign depends upon howwell we estimate the base model. Specifically, the base modelis a neural-network, and we incrementally update it by fine-tuning with the new data. Given the amount of data generatedby digital advertisement platforms, having models that can beeasily fine-tuned is of utmost importance. Table I provides areference for the notation used throughout the paper.IV. B ACKGROUND
The prior work in the area of domain adaptation assumesthat the predictors trained on the source domain are also goodndicators on the target domain when the underlying distri-butions of the source and target domains are similar. Thus,these approaches focus on learning representations ( g ( x ) )that are discriminative as well as, invariant to the domains.Specifically, these approaches minimize the following loss: L DA ( θ, φ | x, y ) = L fDA ( θ, φ | x, y ) + α L gDA ( φ | x ) , where, L fDA is the discrimination loss such as cross-entropy,and measures how well the estimated representation g ( x ) isable to perform the classification in the source domain (inthe target domain as well, as follows from the assumptionof these approaches), L gDA is the loss with respect to thedomain-invariance of the learned representation g ( x ) . Depend-ing upon whether the target-domain labels are available or not,either unsupervised domain adaptation or supervised domainadaptation approaches can be used. Unsupervised domainadaptation approaches estimate domain-invariant representa-tions irrespective of the target labels, while supervised domainadaptation approaches enforce domain invariance per-class . A. Unsupervised Domain Adaptation
Unsupervised domain adaptation approaches estimatedomain-invariant representations irrespective of the target la-bels. The unsupervised approaches model L gDA as: L gDA ( φ | x ) = r ( p x ∼X S ( g ( x )) , p x ∼X T ( g ( x ))) , where p x ∼X S ( g ( x )) is the probability distribution of g ( x ) ,when x is sampled from X S , p x ∼X T ( g ( x )) is the probabilitydistribution of g ( x ) , when x is sampled from X T and r is a distance metric measuring how different are the twodistributions. One of the popular approaches proposed forthe unsupervised domain adaptation is Domain-AdversarialNeural Networks (DANN) [1]. DANN estimates L gDA bymaximising the discriminator loss of a binary classifier thatseparates the two domains using the representation g ( x ) . B. Supervised Domain Adaptation
When the target domain labels are available, supervised do-main adaptation (SDA) approaches [2] can estimate task-aware representations. Specifically, as compared to the unsupervisedapproaches, these approaches try to estimate representations,which are domain-invariant per class . Specifically, the domain-invariant loss for supervised domain adaptation approachescan be written as L gDA ( φ | x ) = K (cid:88) i =1 ( r ( p x ∼X S | y = i ( g ( x )) , p x ∼X T | y = i ( g ( x )))) , where p x ∼X S | y = i ( g ( x )) is the probability distribution of g ( x ) ,when x is sampled from X S and the corresponding label of x is i , p x ∼X T | y = i ( g ( x )) is the probability distribution of g ( x ) ,when x is sampled from X T and the corresponding label of x is i ; and r is a distance metric measuring how different arethe two distributions. Advertiser'swebsiteUser ProductCatalog Advertiser'scategoryfeaturerepresentationClick/Saleevents onads. FeatureEngineering Advertiser'scampaign+categoryfeatures Eligibleusers Advertisementis displayedDomainAdaptation Advertiser'spredictedcampaign+categoryfeaturesSource domainfeatures, unavailableat cold-startTarget domainfeatures, availablefor all advertisersHistorical logs,not available atcold-start
Fig. 2: Work flow of data collection and proposed approaches.V. P
ROPOSED APPROACHES
The prior domain adaptation approaches [1, 2] focus onlearning common representations that are discriminative aswell as, invariant to the domains. However, among infinitesuch possible representations, the one that is closer to thesource domain features is preferred. This is because anymachine learning algorithm depends upon the representation ofthe input data, and the source domain features are engineeredfor the task of targeted display advertising, thus are best-suited for the task at hand. One way to estimate a commonrepresentation that is closer to the source domain is to directlyimpute the source domain features using the target domainfeatures. In other words, we assume that the partners withsimilar target domain representation have similar source do-main representation as well. Since we have both the sourceand target domain features for the head partners, we learn thetransformation function g ( · ) using the target domain featuresof head partners as input and source domain features as output.Once g ( · ) is learned, it can simply be applied on tail partnersto predict their source domain representation. To this extent,we propose two approaches: Interpretable Anchored DomainAdaptation (IADA) and
Latent Anchored Domain Adaptation(LADA) , that model the function g ( · ) to impute the featuresin the source domain, with the target domain features asinput. The term anchored refers to the manner, in which weaim to estimate the representations, i.e., they are anchored tothe source domain representation. The first approach, IADA,directly estimates the observed features in the source domainfeatures, hence it estimates interpretable representation in thesource domain. The second approach, LADA, estimates alatent representation in the source domain, and hence is robustto the curse of dimensionality. Figure 2 shows the workflow ofthe data collection process and how the proposed approachesapply to the gathered data. A. Interpretable Anchored Domain Adaptation (IADA)
IADA is a two-step algorithm: the first step performs theactual transfer of information and predicts the source domainfeatures from the target domain features. The second step pre-dicts whether the user engages with the given advertisement,using the predicted source domain features. Specifically, therst step learns the mapping function g ( · ) which transformsthe target domain features to the source domain features. Thesecond step later takes the output of the function g ( · ) as inputand performs our core prediction task of whether the user willengage will the shown advertisement or not; i.e., the secondstep models the function f ( · ) .
1) Step 1:
The loss function for the first step is given by L gIADA ( φ | x ∼ X T , x ∼ X S ) . Specifically, L gIADA is the loss with respect to predicting thesource domain features x ∼ X S using only the target domainfeatures x ∼ X T . L gIADA can be a regression related loss,such as mean squared error (MSE).
2) Step 2:
The loss function for the second step is givenby L fIADA ( θ, φ | g ( x ∼ X T ) , y ) , L fIADA can be any classification loss, such as cross entropy.The functions can be jointly minimized as a linear combi-nation of the two loss functions, i.e., L f,gIADA = α L fIADA ( θ, φ | g ( x ∼ X T ) , y )+ (1 − α ) L gIADA ( φ | x ∼ X T , x ∼ X S ) , where α is a hyperparameter controlling the contribution ofthe individual loss components.As we start getting the source domain data for fine-tuning,there is no need to model the L gIADA loss. Thus, to incremen-tally update the IADA model as we get the source domaindata, only two modifications need to be done: (i) x ∼ X S isgiven as an input to g ( · ) , the output of which is fed to f ( · ) ; and(ii) we only need to minimize the L fIADA ( θ, φ | g ( x ∼ X T ) , y ) .In this paper, we implement both g ( · ) and f ( · ) functions as amultilayer perceptron, with two hidden layers. B. Latent Anchored Domain Adaptation (LADA)
As compared to IADA, LADA imputes the source domainfeatures, but in the latent space. Thus, training LADA involvesone step in addition to IADA, i.e., estimating the latent rep-resentations of the head partners, to further act as supervisionto learn the transformation function g ( · ) . Particularly, LADAinvolves a three-step algorithm: the first step is to obtaina latent representation in the source domain for the headpartners. The second step performs the information transferand learns the mapping g ( · ) which maps the target domainfeatures, to the latent source domain representation. The thirdstep later takes the output of the function g ( · ) as input andperforms our primary prediction task of whether the user willengage will the shown advertisement or not.
1) Step 1:
The objective here is to obtain a latent repre-sentation in the source domain for the partners for which wehave data in the source domain, i.e., the head partners. To doso, we learn a mapping function h ( · ) which takes as inputthe source domain features (of the head partners) and givesas output a representation h ( x ) , where x ∼ X S is the inputsource domain features. The representation can be modeled as an output of the unsupervised approaches such as auto-encoders. However, to encode the task-specific information inthe latent representation h ( x ) , we also leverage the labeleddata of the head partners. In this direction, h ( x ) is furtherpassed onto a binary classifier e ( · ) which predicts whetherthe user will engage with the advertisement at hand or not.Specifically, the loss function corresponding to the first stepis given by: L h,eLADA ( ψ, υ | x ∼ X S , y ) , where ψ are the parameters of the function h , and υ are theparameters of the function e . L h,eLADA can be any classificationloss, such as cross-entropy. An example of how to model thefirst step is using a feed-forward network with at least onehidden layer, and any of the hidden layers corresponds to thelatent representation we seek.
2) Step 2:
The loss function for the second step is givenby L gLADA ( φ | x ∼ X T , h ( x ∼ X S )) , Specifically, L gLADA is the loss with respect to predicting thesource domain features in the latent space ( h ( x ∼ X S ) ) usingonly the target domain features x ∼ X T ; thus, constrains thatthe learned features g ( x ∼ X T ) are anchored to the latentsource distribution. L gLADA can be a regression related loss,such as mean squared error (MSE).
3) Step 3:
The loss function for the third step is given by L fLADA ( θ, φ | g ( x ∼ X T ) , y ) , L fLADA can be any classification loss, such as cross entropy.Step 1 if performed first to estimate the latent source domainrepresentations of the head partners. Similar to IADA, thefunctions f ( · ) and g ( · ) can be jointly minimized as a linearcombination of the two loss functions, i.e., L f,gLADA = α L fLADA ( θ, φ | g ( x ∼ X T ) , y )+ (1 − α ) L gLADA ( φ | x ∼ X T , h ( x ∼ X S )) , where α is a hyperparameter controlling the contribution ofthe individual loss components.Similar to the IADA model, to incrementally update theLADA model, only two modifications need to be done: (i) x ∼X S is given as an input to g ( · ) , the output of which is fed to f ( · ) ; and (ii) we only need to minimize the L fLADA ( θ, φ | g ( x ∼X T ) , y ) . The function h ( · ) plays no role for fine-tuning.Like IADA, we implement both g ( · ) and f ( · ) functions as amultilayer perceptron, with two hidden layers. The functions h ( · ) and e ( · ) are jointly implemented as a single network. Thisnetwork is a multilayer perceptron, with just one hidden layer.This hidden layer gives the latent representation h ( x ) that weuse for supervision in the later steps.VI. E XPERIMENTAL METHODOLOGY
A. Dataset and Evaluation methodology
We evaluate our methods on how well they can predictthe users that are likely to engage with an advertisement of partner, at all points during the campaign of that partner.We leverage the historical engagement logs of a major digitaladvertiser to estimate and evaluate our methods. As discussedbefore, our target domain is the non-campaign data, and thefeatures in our target domain are the categories (such as elec-tronics, apparel etc.) in which a partner operates. Our sourcedomain is the campaign data, in addition to the non-campaigndata. The features corresponding to the campaign data arespecifically engineered to the task of targeted-advertising usingdomain knowledge. We generated our training, test and valida-tion datasets using the historical engagement data as follows: • We sampled partners and used their data to estimateand evaluate our methods. The data distribution for thesepartners is shown in Figure 1. We assume that the head-partner segment, i.e., the partners for which the majority ofadvertisements are displayed have reached the steady-state.Thus, from the sampled partners, we filtered the part-ners which correspond to of the total advertisementsdisplayed, and these partners constituted our head-partnerssegment. A total of ( of the total partners) consti-tuted this segment. The remaining partners constitutedour tail segment. From these partners, we used datafrom the partners as the validation set to choose thehyperparameters and used data from the other partnersto evaluate our methods. • We used a day of data from May 2019 for training (say,training day) and used the following day of data for eval-uation (say, evaluation day). We first used all the data ofthe head partners from the training day to estimate our basemodels. Then, to incrementally update the base models, werandomly took , , , and of the datafrom each of the partners in the validation and test set, onlyfrom the training day, and used this data to fine-tune thebase models. B. Performance Assessment metrics
Engagement prediction is inherently a classification task.However, in the digital advertisement industry, because of thebudget constraints, we are mainly interested in relatively fewusers that we consider most relevant for an advertisement.Thus, it makes sense to also consider engagement predictionas a ranking task. Thus, we consider classification, as wellas ranking metrics, to evaluate our models. Specifically, weevaluate our approaches on the following three metrics: • Area Under Curve - Receiver Operating Characteristics ( AUC-ROC ): AUC-ROC gives the probability that a ran-domly chosen positive example is deemed to have a higherprobability of being positive than a randomly chosen neg-ative example. It is one of the popular metrics for binaryclassification. • Normalized Discounted Cumulative Gain ( NDCG ) [45]:
NDCG is a popular measure of the ranking quality, thatmeasures the gain (usefulness) of a prediction based on itsposition in the ranked result list. The gain is accumulated from the top of the result list to the bottom, with thegain of each result discounted at lower ranks. Instead oflooking at all the results in the ranked list, usually NDCG is only calculated till a rank k , called as NDCG @ k . Thismakes sense as we are mainly interested in relatively fewusers that we consider most relevant for an advertisement,because of the budget constraints. In this paper, we reportthe results corresponding to k = 1 , . Specifically, NDCG is obtained by normalizing Discounted Cumulative Gain(
DCG ) of the ranking obtained as a result of the predictionsmade by the model which is supposed to be evaluated, withrespect to the ideal ranking.
DCG @ k is given by DCG @ k = k (cid:88) i =1 relevance i log ( i + 1) where relevance i = 1 if the prediction at i th rank is relevant,i.e., the predicted user at the i th rank engages with theuser, and relevance i = 0 otherwise. The logarithm in thedenominator corresponds to discounting the gains at thelower ranks. • Average Precision ( AP ) [46]: AP is also a very popular per-formance measure in information retrieval. AP summarizesthe precision-recall curve as a single number, by computingthe average value of precision p ( r ) as the recall r changesfrom r = 0 to r = 1 . This corresponds to the area under theprecision-recall curve which is given by AP = (cid:82) p ( r ) dr .As done in practice, we replace this integral with a finite sumover every position in the ranked sequence of predictions,i.e., we calculate AP as AP = n (cid:88) i =1 p ( i )∆ r ( i ) , where i is the rank of a prediction in the ranked list, n isthe length of the ranked list (total number of predictions), p ( i ) is the precision at cut-off i in the list, and ∆ r ( i ) is thechange in recall from position k − to k .We report the above metrics in both the micro and macrosettings. In the macro setting, we calculate the metric inde-pendently for each partner and then takes the average (hencetreating all partners equally), whereas, in the micro setting, weaggregate the predictions for all the partners, and compute themetric on these combined predictions. Our primary setting ofinterest is the macro setting, as the micro setting can be biasedtowards the partners with a relatively larger volume of data. C. Baselines
We use two different baselines to evaluate our approaches,as described below: • No transfer (NT) : For the NT baseline, we directly makepredictions using the target domain features, i.e., there is notransfer from the source domain. This baseline resembles themethods such as [15, 16], which do not use the data fromthe frequent-advertised partners to improve the performanceon the target domain. For a fair comparison, we implementT as a multilayer perceptron, with two hidden layers; inthe same manner as IADA and LADA. • Supervised Domain Adaptation (SDA) : Traditional SDAapproaches assume sparse unavailability of the labeled targetdomain data. As such, they do not explicitly model theclassification loss in the target domain. This is not the casewith us, because we have labeled data in the target domainas well. Thus, it would be unfair to directly compare theprior SDA approaches such as SDA-CCSA to IADA andLADA which also leverage the availability of the targetdomain data. Thus, we construct an SDA baseline that alsoleverages the target domain information. Specifically, ourSDA baseline minimizes the following loss: L f,gSDA = α ( L fSDA ( θ, φ | g ( x ∼ X T ) , y )+ L fSDA ( θ, φ | g ( x ∼ X S ) , y ))+ (1 − α ) L gSDA ( φ | x ∼ X T , x ∼ X S ) , where α is a hyperparameter controlling the contribution ofthe individual loss components, and L gSDA ( φ | x ∼ X T , x ∼X S ) is implemented as the MSE loss between the rep-resentations g ( x ∼ X T ) and g ( x ∼ X S ) of the sameadvertisement. D. Parameter selection
The train/test/validation splits were created as per the pro-cess mentioned in Section VI-A. For all the neural networks(IADA, LADA, and baselines), the number of nodes in thehidden dimension is set to . For regularization, we useda dropout[47] of . between all layers, except between thepenultimate and output layer. For optimization, we used theADAM[48] optimizer with the initial learning-rate set to 0.01.The tunable hyperparameter for all the approaches is λ , whichis the contribution of various losses, depending upon themodel. We tune λ for all the approaches corresponding tothe best performance on the macro setting. However, weindependently choose λ for each metric. The λ was tunedusing the grid search from the set { . . . , 1.0 } . Thesechosen values for λ are: (i) for IADA, λ = 0 . for the AUC-ROC and AP metrics, λ = 0 . for the NDCG metric;(ii) for LADA, λ = 0 . for the AUC-ROC and AP metrics, λ = 0 . for the NDCG metric; (iii) for SDA, λ = 0 . for allthe three metrics. To incrementally update the models, we usethe optimizer from the same state, as it was when training ofthe base model finished.VII. R ESULTS AND D ISCUSSION
We illustrate the performance of IADA and LADA andcompare it with the other baselines in this section. First,we discuss the performance of various methods at differentpoints of a campaign in Section VII-A. Later we extensivelyanalyze the particular case of cold-start, i.e., at the start of thecampaign in Section VII-B.
A. Journey from the tail to head
Figures 3 and 4 show the performance of all the approacheson the
AUC-ROC , NDCG and AP metrics, for the macroand micro setting, respectively. The x − axis corresponds tothe fraction of the data used to incrementally update themodels, thus shows the different time points of a campaign.As discussed in Section VI-A, we incrementally update themodels by fine-tuning them using a fraction of data of thevalidation and test partners, but only from the training day.For all the metrics, and both settings (macro and micro),the performance of all the methods generally gets betteras the campaign goes on, i.e., as we get more campaigndata to fine train the models. For all the metrics, and bothsettings (macro and micro), the proposed approaches LADAand IADA outperform all the baselines at all time pointsof a campaign. On the AUC-ROC metric, the performanceimprovement of LADA over the NT baseline at the beginningof the campaign (cold-start) is . and . , withrespect to the macro and micro settings, respectively. As thecampaign goes on, LADA keeps consistently outperformingthe other baselines. When the complete data of a day is usedto fine-tune the models ( x = 1 in Figures 3 and 4), theperformance improvement of LADA over the NT baseline onthe AUC-ROC metric is . and . , for the macroand micro settings, respectively.Similarly, the performance improvement of IADA on the AUC-ROC metric over the NT baseline at the beginning ofthe campaign (cold-start) is . and . , with respectto the macro and micro settings, respectively. As the campaigngoes on, IADA keeps consistently outperforming the otherbaselines but LADA. When the complete data of a day isused to fine-tune the models, the performance improvementof IADA over the NT baseline on the AUC-ROC metric is . and . , with respect to the macro and microsettings, respectively.On the NDCG and AP metrics, we see a similar trendas AUC-ROC , even in a more pronounced manner. On the AP metric, the performance improvement of LADA over theNT baseline at the beginning of the campaign is . and . , with respect to the macro and micro settings,respectively. When the complete data of a day is used to fine-tune the models, the performance improvement of LADA overthe NT baseline on the AP metric is . and . , withrespect to the macro and micro settings, respectively. Similarly,the performance improvement of IADA on the AP metric overthe NT baseline at the beginning of the campaign is . and . , with respect to the macro and micro settings,respectively. When the data of a day is used to fine-tune themodels, the performance improvement of IADA over the NTbaseline on the AUC-ROC metric is . and . ,with respect to the macro and micro settings, respectively.Higher gain on the AP and NDCG metrics as comparedto the
AUC-ROC metric is a result of the class-imbalance.In targeted-advertising, the ratio of advertisements that areengaged by a user, as compared to the total advertisements a) AUC-ROC (macro) (b)
NDCG (macro) (c) AP (macro) Fig. 3: Evaluation on the macro setting for different approaches. LADA and IADA constantly outperform the other baselineswith LADA performing the best. The reported results are average over the runs, with different seed initialization. (a) AUC-ROC (micro) (b)
NDCG (micro) (c) AP (micro) Fig. 4: Evaluation on the micro setting for different approaches. LADA and IADA constantly outperform the other baselineswith LADA performing the best. The reported results are average over the runs, with different seed initialization.displayed is usually less than . Consequently, a largechange in the number of false positives can lead to a smallchange in the false positive rate used in ROC analysis; thusexplaining small gain on the AUC-ROC metric. On the otherhand, Precision, and hence the AP metric, is robust to theclass-imbalance problem [49].We see a unique interesting pattern with the SDA approach.It usually performs better than the NT baseline in the macrosetting for cold starting the partner, but at later points ofthe campaign, its performance, although increases with time,but the rate of increase is not at par with the NT baseline,thus NT performs better than the SDA as the campaign goesahead in time. The reason for this lies in the design of thetransformation function g ( · ) . For SDA, the function g ( · ) , mapsboth the source and target domain data into some commonrepresentation. For cold start, this leads to performance im-provement as the common representation encodes informationfrom the source domain. However, even though it is givenaccess to source domain data for fine-tuning, g ( · ) tends toignore the extra information since it is trained to deal with thesource domain data, by mapping that to a representation, thatis common to the target domain, thus ignoring some signalsfrom the source domain. However, the NT baseline does nothave this limitation. Although it performs worse at cold start, TABLE II: Results on the macro metrics for the particular caseof cold-start ( x = 0 in Figure 3). Model
AUC-ROC NDCG AP (actual) (gain) (actual) (gain) (actual) (gain)IADA .
688 0 . .
361 1 . .
103 2 . LADA .
689 0 . .
363 2 . .
105 3 . NT .
683 0 . .
355 0 . .
101 0 . SDA .
687 0 . .
361 1 . .
102 0 . The gain is reported with respect to the NT baseline. The reported results are average over the runs, with differentseed initialization. owing to lack of source domain information, it readily adaptsto the source domain, thus, performing better than SDA withtime. On the other hand, the proposed approaches IADA andLADA not only perform better at cold-start, but also adapteasily with the source-domain data, and hence, performs bestat all points of the campaign. This is because, the function g ( · ) in IADA and LADA, unlike SDA, is not trained to ignorethe extra source domain information, thus, is easy to fine-tune. B. The special case of cold-start
Tables II and III show the performance of all the methodson all the metrics for the special case of the partner cold-ABLE III: Results on the micro metrics for the particularcase of cold-start ( x = 0 in Figure 4). Model
AUC-ROC NDCG AP (actual) (gain) (actual) (gain) (actual) (gain)IADA .
779 0 . .
585 4 . .
197 2 . LADA .
783 0 . .
592 5 . .
208 8 . NT .
779 0 . .
559 0 . .
191 0 . SDA . − . . − . . − . The gain is reported with respect to the NT baseline.
TABLE IV: Precision@k for the particular case of cold-start( x = 0 in Figure 4). Model/k 50 100 200 500 1000 1500 2000IADA .
154 0 .
137 0 .
117 0 .
092 0 .
074 0 .
064 0 . LADA .
156 0 .
143 0 .
126 0 .
094 0 .
074 0 .
063 0 . NT .
151 0 .
135 0 .
116 0 .
091 0 .
073 0 .
064 0 . SDA .
151 0 .
136 0 .
118 0 .
092 0 .
074 0 .
064 0 . The hyperparameters are the ones that gave the best performanceon the AP metric, for the results shown in 4. (a) ROC curve for the tail partners (b) Truncated ROC curve for the tailpartners Fig. 5: ROC curves for the various methods evaluated on thetail partners.start. As already discussed in Section VII-A, the proposedapproaches LADA and IADA outperform the other baselines.In this section, we take the discussion forward for cold-starting the partners. As outlined in Section VI-B, in the digitaladvertisement industry, we are mainly interested in a relativelyfew users that we consider most relevant for an advertisementbecause of the budget constraints. Hence, we also presentresults for the Precision@k metric. Precision@k representsthe ratio of true positives in the predicted top-k positives.Table IV shows the Mean Precision@k metric, for differentvalues of k and various methods. The metric is reported inthe macro setting, i.e., it is reported as the average across allthe tail partners. We see the same trend as the other metrics,with the proposed approaches LADA and IADA outperformingthe other baselines, and LADA performing the best. As kincreases, we observe that the Precision@k decreases for allthe approaches. This can again be attributed to the class-imbalance problem. The number of true positives grows ata slower pace than the true negatives with an increase ink, thus the Precision@k tends to decrease as k increases.Besides, as k increases, the difference in the performance ofdifferent methods decreases. Moreover, as k reaches , , the Precision@k converges down to the same value for all themethods. The reason for this is: as k increases, the chances thatthe true positives of a partner are covered within the predictedtop-k positives of any model increases. Thus, as k increases,partners tend to contribute equally to the Precision@k metricfor all the models, leading to the same value of Precision@kfor all the models if k is large.In addition, we also perform ROC analysis for the partnercold-start for all the approaches. Figure 5a shows the ROCplot, i.e., the plot of True Positive Rate with respect to theFalse Positive Rate. An ideal model would yield a point in theupper left corner or coordinate (0,1) of the ROC space. Theplot-lines of all the methods lie pretty close to each other withIADA and LADA giving marginally better curves as comparedto the baselines. As discussed earlier, the marginal advantageof IADA and LADA in the ROC space is a result of the class-imbalance problem. A large change in the number of falsepositives can lead to a small change in the false positive rate.However, because of the budget constraints, it can be moreuseful to focus on the region of the ROC space with a high truepositive rate, i.e., few top users who are most likely to engagewith an advertisement. Thus, we look at the truncated ROCspace in Figure 5, i.e., the top-right corner of the ROC space.The truncated ROC plot shows a very clear pattern, with IADAand LADA outperforming the other baselines, illustrating theadvantage of IADA and LADA over the baselines.VIII. C ONCLUSION AND F UTURE WORK
In this paper, we address the challenge of predicting in-terested users for the tail partners in the digital advertisingindustry. Towards that, we developed two domain adaptationapproaches that leverage the similarity among the partnersto transfer information from the partners with sufficient datato similar partners with insufficient data. As compared toother domain-adaptation approaches, that estimate the com-mon discriminative representations between the source andtarget domain, our proposed approaches directly impute thesource domain features using the target domain features. Thetwo proposed approaches, Interpretable Anchored DomainAdaptation (IADA) and Latent Anchored Domain Adaptation(LADA) differ in the manner that IADA directly imputes theobserved features in the source domain, while LADA imputesthe features in the latent domain, and hence is robust to thecurse of dimensionality.To the best of our knowledge, ours is the first attempt atusing domain-adaptation approaches to transfer informationfrom the head partners to the tail partners in the digitaladvertising industry. We envision that the proposed approacheswill serve as a motivation for other applications that also sufferthrough preferential attachment.R
EFERENCES[1] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-lette, M. Marchand, and V. Lempitsky, “Domain-adversarial training ofneural networks,”
The Journal of Machine Learning Research , vol. 17,no. 1, pp. 2096–2030, 2016.2] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deepsupervised domain adaptation and generalization,” in
Proceedings of theIEEE International Conference on Computer Vision , 2017.[3] A. Arnold, R. Nallapati, and W. W. Cohen, “A comparative study ofmethods for transductive transfer learning.” in
ICDM Workshops , 2007.[4] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”
Neurocomputing , vol. 312, pp. 135–153, 2018.[5] M. Li, W. Zuo, and D. Zhang, “Deep identity-aware transfer of facialattributes,” arXiv preprint arXiv:1610.05586 , 2016.[6] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2017.[7] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain imagegeneration,” arXiv preprint arXiv:1611.02200 , 2016.[8] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,“Learning from simulated and unsupervised images through adversarialtraining,” in
Proceedings of the IEEE CVPR , 2017, pp. 2107–2116.[9] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan,“Unsupervised pixel-level domain adaptation with generative adversarialnetworks,” in
Proceedings of the IEEE CVPR , 2017, pp. 3722–3731.[10] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in
NIPS , 2017, pp. 700–708.[11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in
Proceedingsof the IEEE international conference on computer vision , 2017.[12] S. Ruder, “Neural transfer learning for natural language processing,”Ph.D. dissertation, NATIONAL UNIVERSITY OF IRELAND, GAL-WAY, 2019.[13] S. Bickel, C. Sawade, and T. Scheffer, “Transfer learning by distributionmatching for targeted advertising,” in
Advances in neural informationprocessing systems , 2009, pp. 145–152.[14] Y. Su, Z. Jin, Y. Chen, X. Sun, Y. Yang, F. Qiao, F. Xia, and W. Xu,“Improving click-through rate prediction accuracy in online advertisingby transfer learning,” in
Proceedings of the International Conference onWeb Intelligence . ACM, 2017, pp. 1018–1025.[15] B. Dalessandro, D. Chen, T. Raeder, C. Perlich, M. Han Williams, andF. Provost, “Scalable hands-free transfer learning for online advertising,”in
Proceedings of the 20th ACM KDD . ACM, 2014, pp. 1573–1582.[16] C. Perlich, B. Dalessandro, T. Raeder, O. Stitelman, and F. Provost,“Machine learning for targeted display advertising: Transfer learning inaction,”
Machine learning , vol. 95, no. 1, pp. 103–127, 2014.[17] K. Aggarwal, P. Yadav, and S. S. Keerthi, “Domain adaptation in displayadvertising: an application for partner cold-start,” in
Proceedings of the13th ACM Conference on Recommender Systems , 2019, pp. 178–186.[18] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady,L. Nie, T. Phillips, E. Davydov, D. Golovin et al. , “Ad click prediction:a view from the trenches,” in
Proceedings of the 19th ACM KDD , 2013.[19] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks:estimating the click-through rate for new ads,” in
Proceedings of the16th international conference on World Wide Web . ACM, 2007.[20] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalableresponse prediction for display advertising,”
ACM Transactions onIntelligent Systems and Technology (TIST) , vol. 5, no. 4, p. 61, 2015.[21] D. Agarwal, R. Agrawal, R. Khanna, and N. Kota, “Estimating rates ofrare events with multiple hierarchies through scalable log-linear models,”in
Proceedings of the 16th ACM SIGKDD international conference onKnowledge discovery and data mining . ACM, 2010, pp. 213–222.[22] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Her-brich, S. Bowers et al. , “Practical lessons from predicting clicks on adsat facebook,” in
Proceedings of the Eighth International Workshop onData Mining for Online Advertising . ACM, 2014, pp. 1–9.[23] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorizationmachines for ctr prediction,” in
Proceedings of the 10th ACM Conferenceon Recommender Systems . ACM, 2016, pp. 43–50.[24] J. Pan, J. Xu, A. L. Ruiz, W. Zhao, S. Pan, Y. Sun, and Q. Lu,“Field-weighted factorization machines for click-through rate predictionin display advertising,” in
Proceedings of the 2018 World Wide WebConference , 2018, pp. 1349–1357.[25] Z. Pan, E. Chen, Q. Liu, T. Xu, H. Ma, and H. Lin, “Sparse factor-ization machines for click-through rate prediction,” in , 2016.[26] W. Zhang, T. Du, and J. Wang, “Deep learning over multi-field categor-ical data,” in
European conference on information retrieval , 2016.[27] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization- machine based neural network for ctr prediction,” arXiv preprintarXiv:1703.04247 , 2017.[28] W. Liu, R. Tang, J. Li, J. Yu, H. Guo, X. He, and S. Zhang, “Field-aware probabilistic embedding neural network for ctr prediction,” in
Proceedings of the 12th ACM Conference on Recommender Systems .ACM, 2018, pp. 412–416.[29] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye,G. Anderson, G. Corrado, W. Chai, M. Ispir et al. , “Wide & deeplearning for recommender systems,” in
Proceedings of the 1st workshopon deep learning for recommender systems . ACM, 2016, pp. 7–10.[30] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li,and K. Gai, “Deep interest network for click-through rate prediction,”in
Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining . ACM, 2018, pp. 1059–1068.[31] Y. Ni, D. Ou, S. Liu, X. Li, W. Ou, A. Zeng, and L. Si, “Perceive yourusers in depth: Learning universal user representations from multiplee-commerce tasks,” in
Proceedings of the 24th ACM SIGKDD . ACM,2018, pp. 596–605.[32] Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian, B. Wang, andT.-Y. Liu, “Sequential click prediction for sponsored search with recur-rent neural networks,” in
Twenty-Eighth AAAI Conference on ArtificialIntelligence , 2014.[33] J. Chen, B. Sun, H. Li, H. Lu, and X.-S. Hua, “Deep ctr predictionin display advertising,” in
Proceedings of the 24th ACM internationalconference on Multimedia . ACM, 2016, pp. 811–820.[34] S. Zhai, K.-h. Chang, R. Zhang, and Z. M. Zhang, “Deepintent: Learningattentions for online advertising with recurrent neural networks,” in
Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining . ACM, 2016, pp. 1295–1304.[35] B. Edizel, A. Mantrach, and X. Bai, “Deep character-level click-through rate prediction for sponsored search,” in
Proceedings of the40th International ACM SIGIR . ACM, 2017, pp. 305–314.[36] M. Regelson and D. Fain, “Predicting click-through rate using keywordclusters,” in
Proceedings of the Second Workshop on Sponsored SearchAuctions , vol. 9623, 2006, pp. 1–6.[37] H. Cheng and E. Cant´u-Paz, “Personalized click prediction in sponsoredsearch,” in
Proceedings of the third ACM international conference onWeb search and data mining . ACM, 2010, pp. 351–360.[38] M. Sharma and G. Karypis, “Adaptive matrix completion for the usersand the items in tail,” in
The World Wide Web Conference . ACM, 2019.[39] X. Wang, Z. Peng, S. Wang, S. Y. Philip, W. Fu, and X. Hong, “Cross-domain recommendation for cold-start users via neighborhood basedfeature mapping,” in
International Conference on Database Systems forAdvanced Applications . Springer, 2018, pp. 158–165.[40] J. Bobadilla, F. Ortega, A. Hernando, and J. Bernal, “A collaborative fil-tering approach to mitigate the new user cold start problem,”
Knowledge-Based Systems , vol. 26, pp. 225–238, 2012.[41] L. Safoury and A. Salah, “Exploiting user demographic attributes forsolving cold-start problem in recommender system,”
Lecture Notes onSoftware Engineering , vol. 1, no. 3, pp. 303–307, 2013.[42] S. Manchanda, M. Sharma, and G. Karypis, “Intent term weighting ine-commerce queries,” in
Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management . ACM, 2019,pp. 2345–2348.[43] Y. Song, H. Wang, W. Chen, and S. Wang, “Transfer understandingfrom head queries to tail queries,” in
Proceedings of the 23rd ACMInternational Conference on Conference on Information and KnowledgeManagement . ACM, 2014, pp. 1299–1308.[44] S. Manchanda, M. Sharma, and G. Karypis, “Intent term selection andrefinement in e-commerce queries,” arXiv preprint arXiv:1908.08564 ,2019.[45] Y. Wang, L. Wang, Y. Li, D. He, W. Chen, and T.-Y. Liu, “A theoreticalanalysis of ndcg ranking measures,” in
Proceedings of the 26th annualconference on learning theory (COLT 2013) , vol. 8, 2013, p. 6.[46] M. Zhu, “Recall, precision and average precision,”
Department ofStatistics and Actuarial Science, University of Waterloo, Waterloo , vol. 2,p. 30, 2004.[47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overfit-ting,”
The Journal of Machine Learning Research , vol. 15, no. 1, 2014.[48] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[49] J. Davis and M. Goadrich, “The relationship between precision-recalland roc curves,” in