[PDF] A Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising

Abstract

Creative plays a great important role in e-commerce for exhibiting products. Sellers usually create multiple creatives for comprehensive demonstrations, thus it is crucial to display the most appealing design to maximize the Click-Through Rate~(CTR). For this purpose, modern recommender systems dynamically rank creatives when a product is proposed for a user. However, this task suffers more cold-start problem than conventional products recommendation In this paper, we propose a hybrid bandit model with visual priors which first makes predictions with a visual evaluation, and then naturally evolves to focus on the specialities through the hybrid bandit model. Our contributions are three-fold: 1) We present a visual-aware ranking model (called VAM) that incorporates a list-wise ranking loss for ordering the creatives according to the visual appearance. 2) Regarding visual evaluations as a prior, the hybrid bandit model (called HBM) is proposed to evolve consistently to make better posteriori estimations by taking more observations into consideration for online scenarios. 3) A first large-scale creative dataset, CreativeRanking, is constructed, which contains over 1.7M creatives of 500k products as well as their real impression and click data. Extensive experiments have also been conducted on both our dataset and public Mushroom dataset, demonstrating the effectiveness of the proposed method.

Full PDF

AA Hybrid Bandit Model with Visual Priors for Creative Rankingin Display Advertising

Shiyao Wang

Alibaba GroupBeijing, [email protected]

Qi Liu ∗ University of Science and Technologyof ChinaHefei, [email protected]

Tiezheng Ge

Alibaba GroupBeijing, [email protected]

Defu Lian

University of Science and Technologyof ChinaHefei, [email protected]

Zhiqiang Zhang

Alibaba GroupBeijing, [email protected]

ABSTRACT

Creative plays a great important role in e-commerce for exhibitingproducts. Sellers usually create multiple creatives for comprehen-sive demonstrations, thus it is crucial to display the most appealingdesign to maximize the Click-Through Rate (CTR). For this purpose,modern recommender systems dynamically rank creatives whena product is proposed for a user. However, this task suffers morecold-start problem than conventional products recommendationsince the user-click data is more scarce and creatives potentiallychange more frequently. In this paper, we propose a hybrid banditmodel with visual priors which first makes predictions with a visualevaluation, and then naturally evolves to focus on the specialitiesthrough the hybrid bandit model. Our contributions are three-fold:1) We present a visual-aware ranking model (called VAM) thatincorporates a list-wise ranking loss for ordering the creatives ac-cording to the visual appearance. 2) Regarding visual evaluationas a prior, the hybrid bandit model (called HBM) is proposed toevolve consistently to make better posteriori estimations by takingmore observations into consideration for online scenarios. 3) Afirst large-scale creative dataset,

CreativeRanking , is constructed,which contains over 1.7M creatives of 500k products as well as theirreal impression and click data. Extensive experiments have alsobeen conducted on both our dataset and public Mushroom dataset,demonstrating the effectiveness of the proposed method. KEYWORDS

Hybrid Bandit Model, Visual Priors, Creative Ranking

ACM Reference Format:

Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang. 2021.A Hybrid Bandit Model with Visual Priors for Creative Ranking in DisplayAdvertising. In Proceedings of the Web Conference 2021 (WWW ’21), April ∗ This work was done when the author Qi Liu was at Alibaba Group for intern.Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

Figure 1: Some examples of ad creatives. Each row presentscreatives that display the product in multiple ways. The cor-responding CTRs at the bottom row indicate the large CTRgap among creatives.

ACM, New York, NY, USA, 11 pages. https://doi.org/xxx

Online display advertising is a rapid growing business and has be-come an important source of revenue for Internet service providers.The advertisements are delivered to customers through variousonline channels, e.g. e-commerce platform. Image ads are the mostwidely used format since they are more compact, intuitive and com-prehensible [8]. In Figure 1, each row composes several ad imagesthat describe the same product for comprehensive demonstrations.These images are called creatives. Although the creatives representthe same product, they may have largely different CTRs due to thevisual appearance. Thus it is crucial to display the most appealingdesign to attract the potentially interested customers and maximizethe Click-Through Rate(CTR).In order to explore the most appealing creative, all of the candi-dates should be displayed to customers. Meanwhile, to ensure theoverall performance of advertising, we prefer to display the creative a r X i v : . [ c s . C V ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang that has the highest predicted CTR so far. This procedure can bemodeled as a typical multi-armed bandit problem (MAB) . It not onlyfocuses on maximizing cumulative rewards (clicks) but also balancethe exploration-exploitation(E&E) trade-off within a limited explo-ration resource so that CTR are considered. Epsilon-greedy [12],Thompson sampling [25] and Upper Confidence Bounds (UCB) ap-proaches [2] are widely used strategies to deal with the bandit prob-lem.

However, creatives potentially change more frequentlythan products, and most of them cannot have sufficient im-pression opportunities to get reliable CTRs throughout theirlifetime.

So the conventional bandit models may suffer from cold-start problem in the initial random exploration period, hurting theonline performance extremely. One potential solution to this prob-lem is incorporating visual prior knowledge to facilitate a betterexploration. [3, 8, 9, 21] consider the visual features extracted bydeep convolutional networks and make deterministic selectionsfor product recommendation. These deep models are in a heavycomputation and cannot be flexibly updated online. Besides, thedeterministic and greedy strategy may result in suboptimal solu-tion due to the lack of exploration. Consequently, how to combineboth the expressive visual representations and flexible bandit modelremains a challenging problem.In this paper, we propose an elegant method which incorporatesvisual prior knowledge into bandit model for facilitating a better ex-ploration. It is based on a framework called NeuralLinear [24]. Theyconsider approximate bayesian neural networks in a Thompsonsampling framework to utilize both the learning ability of neuralnetworks and the posterior sampling method. By adopting thisgeneral framework, we first present a novel convolutional networkwith a list-wise ranking loss function to select the most attractivecreative. The ranking loss concentrates on capturing the visual pat-terns related to attractiveness, and the learned representations aretreated as contextual information for the bandit model. Second, interms of the bandit model, we make two major improvements: 1) In-stead of randomly setting a prior hyperparameter to candidate arms,we use the weights of neural network to initialize the bandit pa-rameters that further enhance the performance in the cold-startingphase. 2) To fit the industrial-scale data, we extend the linear regres-sion model of NeuralLinear to a hybrid model which adopts twoindividual parameters, i.e. product-wise ones and creative-specificones. The two components are adaptively combined during theexploring period. Last but not the least, because the creative rank-ing is a novel problem, it lacks real-world data for further studyand comparison. To this end, we contribute a large-scale creativedataset from Alibaba display advertising platform that comprisesmore than 500k products and 1.7M ad creatives.In summary, the contributions of this paper include:- We present a visual-aware ranking model (called VAM) that iscapable of evaluating new creatives according to the visual appear-ance.- Regarding the learned visual predictions as a prior, the im-proved hybrid bandit model (called HBM) is proposed to makebetter posteriori estimations by taking more observations into con-sideration. The Data and code are publicly available at https://github.com/alimama-creative/A_Hybrid_Bandit_Model_with_Visual_Priors_for_Creative_Ranking.git - We construct a novel large-scale creative dataset named

Cre-ativeRanking . Extensive experiments have been conducted on bothour dataset and public Mushroom dataset, demonstrating the effec-tiveness of the proposed method. Problem Statement

The problem statement is as follows. Given aproduct, the goal is to determine which creative is the most attrac-tive one and should be displayed. Meanwhile, we need to estimatethe uncertainty of the predictions so as to maximize cumulativereward in a long run.In the online advertising system, when an ad is shown to a userby displaying a candidate creative, this scenario is counted as an im-pression. Suppose there are 𝑁 products, denoted as { 𝐼 , 𝐼 , · · · , 𝐼 𝑛 , · · · , 𝐼 𝑁 } , and each product 𝐼 𝑛 composes a group of creatives, indi-cated as { 𝐶 𝑛 , 𝐶 𝑛 , · · · , 𝐶 𝑛𝑚 , · · · , 𝐶 𝑛𝑀 } . For product 𝐼 𝑛 , the objective isto find the creative that subjects to: 𝐶 𝑛 = arg max 𝑐 ∈{ 𝐶 𝑛 ,𝐶 𝑛 , ··· ,𝐶 𝑛𝑀 } 𝐶𝑇 𝑅 ( 𝑐 ) (1)where 𝐶𝑇 𝑅 (·) denotes the CTR for a given creative. An empiri-cal way to produce CTR is accumulating the current clicks andimpressions, and produce the click ratio as:ˆ

𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) = 𝑐𝑙𝑖𝑐𝑘 ( 𝐶 𝑛𝑚 ) 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐶 𝑛𝑚 ) (2)where 𝑐𝑙𝑖𝑐𝑘 (·) and 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (·) indicate the click and impressionnumber of the creative 𝐶 𝑛𝑚 . But it may suffer from insufficientimpressions, especially for the cold-start creatives. Another way isto learn a prediction function N (·) from all the historical data byconsidering the contextual information (i.e. the image content) as:ˆ

𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) = N ( 𝐶 𝑛𝑚 ) (3)where N (·) takes the image content of creative 𝐶 𝑛𝑚 as input, andlearns from the historical data. The collected sequential data canbe represented as D = {( 𝐶 , 𝑦 ) , · · · , ( 𝐶 𝑡 , 𝑦 𝑡 ) , · · · , ( 𝐶 |D | , 𝑦 |D | )} (4)where 𝑦 𝑡 ∈ { , } is the label denotes whether a click is received.We take both the statistical data and content information into con-sideration. Subsection 2.2 reviews some product recommendationmethods that take visual content as auxiliary information, and sub-section 2.3 introduces typical bandit models to estimate uncertainty.Both of above methods will be our strong baselines. CTR prediction of image ads is a core task of online display ad-vertising systems. Due to the recent advances in computer vision,visual features are employed to further enhance the recommen-dation models [3, 6, 8, 9, 14, 20, 21, 31, 33]. [3, 9] quantitativelystudy the relationship between handcrafted visual features and cre-ative online performance. Different from fixed handcrafted features,[6, 14, 33] apply “off-the-shelf” visual features extracted by deepconvolutional neural network[29]. [8, 21, 31] extend these meth-ods by training the CNNs in an end-to-end manner. [20] integratethe category information on top of the CNN embedding to help

Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising WWW ’21, April 19–23, 2021, Ljubljana, Slovenia W o m e n ' s t op W o m e n ' s bo tt o m s M e n ' s W o m e n ' s s ho e s F u r n it u r e B a by c l o t h i ng L ugg a g e J e w e l r y U nd e r w ea r M e n ' s s ho e s S k i n ca r e H o m e f a b r i c A u t o A cce ss o r i e s S po r t s / yog a M a k e up / P e rf u m e C d i g it a l acce ss o r i e s T a b l e w a r e T oyb e d li n i ng s O u t doo r CTRs

Properties StatisticsNumber of Products 500,827max candidates(arms) 11min candidates(arms) 3mean candidates(arms) 3.4

CTR of Poor Creatives CTR of Average performance CTR of Best CreativesNum products of the Top20 categories (a) (b) (c)

Figure 2: Statistical Analysis of the

Creative Ranking dataset. (a) summarizes some basic information, while (b) shows thenumber of products in terms of product categories. (c) conducts CTR analysis by comparing poor and good creatives. visual modeling. The above works focus on improving the productranking by considering visual information while neglecting thegreat potential of creative ranking . There is a few work so farto address this topic. idealo.de (portal of the German e-commercemarket) adopts an aesthetic model[11] to select the most attractiveimage for each recommended hotel. They believe that photos canbe just as important for bookings as reviews. PEAC [34] resemblesour method the most and they aim to rank ad creatives based onthe visual content. But it is an offline evaluation model that cannotflexibly update the ranking strategy when receiving online obser-vations. Besides, all above methods do not model the uncertaintywhich may lack of exploration ability.

Multi-armed bandits (MAB) problem is a typical sequential de-cision making process that is also treated as an online decisionmaking problems [32]. A wide range of real world applicationscan be modeled as MAB problems, such as online recommenda-tion system [16], online advertising [27] and information retrieval[15]. Epsilon-greedy [12], Thompson sampling [25] and UCB [2]are classic context-free algorithms. They use reward/cost from theenvironment to update their E&E policy without contextual infor-mation. It is difficult for model to quickly adjust to new creatives(arms) since the web content undergoes frequent changes. [1, 19, 24]extend these context-free methods by considering side informationlike user/content representations. They assume that the expectedpayoff of an arm is linear in its features. The main problem linearalgorithms face is their lack of representational power, which theycomplement with accurate uncertainty estimates. A natural attemptat getting the best of both representation learning ability and accu-rate uncertainty estimation consists in performing a linear payoffson top of a neural network. NeuralLinear [24] present a Thompsonsampling based framework that simultaneously learn a data repre-sentation through neural networks and quantify the uncertaintyover Bayesian linear regression. Inspired by this framework, wefurther improve both the neural network and bandit method thatbenefit our creative ranking problem.

In order to promote further research and comparison on creativeranking, we contribute a large-scale creative dataset to the researchcommunity. It composes creative images and sequential impres-sion data which can be used for evaluating both visual predictions and E&E strategies. In this section, we first describe how the cre-atives and online feedbacks are collected in subsection 3.1. Thenwe provide a statistical analysis of the dataset in subsection 3.2.

We collect a large and diverse set of creatives from Alibaba displayadvertising platform during July 1, 2020 to August 1, 2020. Thetotal number of impression is approximately 215 million. Thereare 500,827 products with 1,707,733 ad creatives. We make thisdataset publicly available for further research and comparison. Thecreative and online feedback collection is subject to the followingconstraints:

Randomized logging policy.

The online system adopts ran-domized logging policy so that the creatives are randomly drawnto collect an unbiased dataset. Bandit algorithms learn policiesthrough interaction data. Training or evaluation on offline datamay suffer from exposure bias called "off-policy evaluation prob-lem" [23]. In [19], they demonstrate that if logging policy chooseseach arm uniformly at random, the estimation of bandit algorithmsis unbiased. Thus, for each impression of product 𝐼 𝑛 , the policy willrandomly choose a candidate creative, and gather their clicks. Aligned creative lifetime.

Due to the complexity of onlineenvironment, the CTRs vary for different time periods, even forthe same creative. Creatives will be newly designed or deleted,which will result to inconsistent exposure time (as Figure 3(a)). Inorder to avoid the noise brought by the different time intervals,we only collect the overlap period among the candidate creatives(see Figure 3(b)). Besides, the overlap should be within 5 to 14 days,which covers the creative lifetime from the cold-starting to relativestable stage. All the filtered creatives are gathered to build thesequential data. 𝐼 ! 𝐼 " 𝐶 !! 𝐶 𝐶 $! 𝐶 !% 𝐶 𝐶 $% … 𝐶 &% 𝐼 ! 𝐼 " 𝐶 !! 𝐶 𝐶 $! overlap 𝐶 !% 𝐶 𝐶 $% … 𝐶 &% overlap Exposure Timeline (a) Exposure period (b) Aligned period

Figure 3: Aligned creative lifetime.Train/Validation/Test split.

We randomly split the 500,827products into 300,242 training, 100,240 validation and 100,345 testsamples, with 1,026,378/340,449/340,906 creatives respectively. We

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang … … Top 1 …

Impression … Click/Non-click (a) Visual-aware Ranking Model (VAM) (b) Hybrid Bandit Model (HBM) 𝒩 !" 𝒘 L i s t - w i s e R a nk i ng L o ss Point-wise Auxiliary Loss 𝐶 !" 𝐶 𝐶 $" 𝐶 %" 𝒇 %& 𝑠 %& Visual Priors 𝐼 ! ,𝐼 " ,… , 𝐼 : products 𝐶 !$ , 𝐶 "$ , …, 𝐶 %$ : candidate creatives of 𝐼 $ 𝐶 $ : best creative for 𝐼 $ 𝒇 &$ : visual representations of 𝐶 &$ 𝑠 &$ : visual predictions of 𝐶 &$ 𝑦 &$ : bandit predictions of 𝐶 &$ (𝜇 &$ , 𝜎 &$" , Σ &$ , 𝑎 &$ , 𝑏 &$ ),𝒘 &$ : creative-specific parameters and sampled weights (𝜇 $ , 𝜎 $" ,Σ $ , 𝑎 $ , 𝑏 $ ), 𝒘 $ : product-wise parameters and sampled weights 𝜆 : fusion weights … Product-wise 𝒘 $ ~𝒩(𝜇 $ , 𝜎 $" Σ $’! ) 𝒇 %& Creative-specific 𝒘 &$ ~𝒩(𝜇 &$ ,𝜎 &$ " Σ &$ ’! ) 𝑦 %& 𝜆1 − 𝜆 … 𝒘 $ 𝒘 ($ 𝑦 ’& … 𝒘 $ 𝒘 %$ 𝑦 (& 𝐶 ! argmax Notations

Figure 4: (Better viewed in color) The overall framework of the proposed Hybrid Bandit Model with Visual Priors. It receivesseveral candidate creatives (shown in one column on the left) and try to find the most attractive one through both Visual-awareRanking Model (VAM) and Hybrid Bandit Model (HBM). (a) VAM is to develop a CNN model that can evaluate creatives base ontheir visual content. (b) According to the visual priors, HBM aims to estimate the posterior and correct the ranking strategy. treat each product as a sample, and aim to select the best creativeamong candidates. The proposed VAM is learned from the trainingset, while the bandit model HBM is deployed on the validation/testdata. This setting is used to prove the effectiveness of visual predic-tions on the unseen products/creatives, and whether the policy canmake a better posterior estimations by using online observations.

The proposed dataset is collected from ad interaction logs across 32days. Figure 2(a) gives a summary of our CreativeRanking dataset.It consists of 500,827 products, covering 124 categories. The minand max candidate creatives for a product is 3 and 11, while averagenumber is 3.4. In fact, the number of candidates in the real-world sce-narios far exceeds 3.4, but the offline dataset is constrained by con-ditions introduced by subsection 3.1. Figure 2(b) shows the numberof products for top 20 categories, namely Women’s tops, Women’sbottoms, Men’s, Women’s shoes, Furniture, and so on. In Figure2, we make further analysis about creatives for these categories.Suppose we know the CTR for each creative, we select the poorestand best creatives for each product, and accumulate their overallperformance, which is visualized as grey and (grep+blue+orange)bins. We find that the CTR of a product can be extremely lifted byselecting a good creative. Specifically, a good creative is capable oflifting CTR by 148% ∼ ∼ CreativeRanking dataset, we would like todraw more attention to this topic which benefits both the researchcommunity and website’s user experience.

We briefly overview the entire pipeline. Main notations used in thispaper are summarized in the right panel of Figure 4. First, as shown in Figure 4(a), feature extraction network N 𝑓 𝑒𝑎𝑡 will simultaneouslyreceive the creatives of the 𝑛 − th product as input, and produce the 𝑑 -dimensional intermediate features { 𝒇 𝑛 , 𝒇 𝑛 , · · · , 𝒇 𝑛𝑚 , · · · , 𝒇 𝑛𝑀 } . Then,a fully connected layer are employed to calculate the scores forthem, indicated as { 𝑠 𝑛 , 𝑠 𝑛 , · · · , 𝑠 𝑛𝑚 , · · · , 𝑠 𝑛𝑀 } .Second, the list-wise ranking loss and auxiliary regression loss areintroduced to guide the learning procedure. Such a multi-objectiveoptimization helps the model not only focus on creative ranking,but also take into account the numerical range of CTR that is benefitfor the following bandit model. In addition, due to the data noisethat is a common problem in a real-world application, we provideseveral practical solutions to mitigate casual and malicious noise.Details are described in Subsection 4.2.After the above steps, the model can evaluate the creative qualitydirectly by its visual content, even a newly uploaded one withoutany history information. Then we propose a hybrid bandit modelthat incorporates learned 𝒇 𝑛𝑚 as contextual information, and updatethe policy by interacting with online observations. As in Figure 4(b),the hybrid model combines both product-wise and creative-specificpredictions which is more flexible for complex industrial data. Theelaborated formulations are in Subsection 4.3. Given a product 𝐼 𝑛 , we use feature extraction network N 𝑓 𝑒𝑎𝑡 toextract high-level visual representations of creatives. And a lin-ear layer is adopted to produce the attractiveness scores for 𝑚 -thcreative of 𝑛 -th product: 𝒇 𝑛𝑚 = N 𝑓 𝑒𝑎𝑡 ( 𝐶 𝑛𝑚 ) (5) 𝑠 𝑛𝑚 = 𝒇 𝑛𝑇𝑚 𝒘 (6)where 𝒘 are learnable parameters of the linear layer. List-wise Ranking Loss . To learn the relative order of creatives,we need to map a list of predicted scores and ground-truth CTRsto a permutation probability distribution, respectively, and then

Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising WWW ’21, April 19–23, 2021, Ljubljana, Slovenia take a metric between these distributions as a loss function. Themapping strategy and evaluation metric should guarantee that thecandidates with higher scores would be ranked higher. [5] proposedpermutation probability and top 𝑘 probability definitions. Inspiredby this work, we simplify the probability of a creative being rankedon the top 1 position as 𝑝 𝑛𝑚 = 𝑒𝑥𝑝 ( 𝑠 𝑛𝑚 ) (cid:205) 𝑀𝑖 = 𝑒𝑥𝑝 ( 𝑠 𝑛𝑖 ) (7)where 𝑒𝑥𝑝 (·) is an exponential function. The exponential func-tion based top-1 probability is both scale invariant and translationinvariant. And the corresponding labels are 𝑦 𝑟𝑎𝑛𝑘 ( 𝐶 𝑛𝑚 ) = 𝑒𝑥𝑝 ( 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) ,𝑇 ) (cid:205) 𝑀𝑖 = 𝑒𝑥𝑝 ( 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑖 ) ,𝑇 ) (8)where 𝑒𝑥𝑝 (· ,𝑇 ) is exponential function with temperature 𝑇 . Sincethe 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) is about a few percent, we use 𝑇 to adjust the scaleof the value so that make the probability of top1 sample close to 1.With Cross Entropy as metric, the loss for product 𝐼 𝑛 becomes L 𝑛𝑟𝑎𝑛𝑘 = − ∑︁ 𝑚 𝑦 𝑟𝑎𝑛𝑘 ( 𝐶 𝑛𝑚 ) 𝑙𝑜𝑔 ( 𝑝 𝑛𝑚 ) (9)Through such objective function, the model focuses on compar-ing the creatives within the same product. We concentrate on thetop-1 probability since it is consistent with real scenarios whichwill display only one creative for each impression. Besides, theend-to-end training manner greatly utilizes the learning ability ofdeep CNNs and boosts the visual prior knowledge extraction. Point-wise auxiliary regression Loss . In addition to the list-wise ranking loss, we expect that the point-wise regression enforcethe model to produce more accurate predictions. Actually, the rank-ing loss function only requires the order of outputs, leaving thenumerical scale of the outputs unconstrained. Since the learnedrepresentations will be adopted as prior knowledge for the banditmodel in Subsection 4.3, making the outputs close to the real CTRswill significantly stabilize the bandit learning procedure. Thus weadd the point-wise regression as a regularizer. The formulation is L 𝑛𝑟𝑒𝑔 = ∑︁ 𝑚 || 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) − 𝑠 𝑛𝑚 || (10)where || · || denotes 𝐿 norm. Finally, we add up both the rankingloss and the auxiliary loss to form the final loss: L 𝑛 = L 𝑛𝑟𝑎𝑛𝑘 + 𝛾 L 𝑛𝑟𝑒𝑔 (11)where 𝛾 is 0.5 in our experiments. Noise Mitigation . In both list-wise ranking and point-wise re-gression in Equation 8 and 10,

𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) can be estimated by Equa-tion 2. But in real-world dataset, some creatives have not sufficientimpression opportunities, and the estimation may suffer from hugevariance. For example, a creative only get one impression, and aclick is accidentally recorded from this impression, the ˆ 𝐶𝑇 𝑅 willbe set to 1, which is inevitably unreliable. To mitigate the problem,we provide two practical solutions, namely label smoothing andweighted sampling.

Label smoothing is an empirical Bayes method that is utilized tosmoothen the CTR estimation [30]. Suppose the clicks are from a binomial distribution and the CTR follows a prior distribution as 𝑐𝑙𝑖𝑐𝑘𝑠 ( 𝐶 𝑛𝑚 ) ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 ( 𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐶 𝑛𝑚 ) , 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 )) 𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) ∼ 𝐵𝑒𝑡𝑎 ( 𝛼, 𝛽 ) (12)where 𝐵𝑒𝑡𝑎 ( 𝛼, 𝛽 ) can be regarded as the prior distribution of CTRs.After observing more clicks, the conjugacy between Binomial andBeta allows us to obtain the posterior distribution and the smoothedˆ 𝐶𝑇 𝑅 as ˆ

𝐶𝑇 𝑅 ( 𝐶 𝑛𝑚 ) = 𝑐𝑙𝑖𝑐𝑘 ( 𝐶 𝑛𝑚 ) + 𝛼𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐶 𝑛𝑚 ) + 𝛼 + 𝛽 (13)where 𝛼 and 𝛽 can be yielded by using maximum likelihood estimatethrough all the historical data[30]. Compared to the original way,the smoothed ˆ 𝐶𝑇 𝑅 has smaller variance and benefits the training.

Weighted sampling is a sampling strategy for training process.Instead of treating each product equally, we pay more attentionto the products whose impressions are adequate and the CTRs aremore reliable. The sampling weights can be produced by 𝑝 𝑛 = 𝑔 ( 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐼 𝑛 )) (14)where 𝑔 (·) is set to the logarithm of the impressions and 𝑝 𝑛 denotesthe sampling weight of product 𝐼 𝑛 .All above modules are integrated in a unified framework andthe visual-aware ranking model focuses on learning the generalvisual patterns about display performance. And then the informa-tive representations are applied as prior knowledge for the banditalgorithm. In this section, we provide an elegant and efficient strategy thattackles the E&E dilemma by utilizing the visual priors and updatingthe posterior via the hybrid bandit model. Based on NeuralLinearframework [24], we build a Bayesian linear regression on the ex-tracted visual representation. Assume the online feedback data isgenerated as follows: 𝒚 = 𝒇 𝑇 ˜ 𝒘 + 𝜖 (15)where 𝒚 represent clicked/non-clicked data and 𝒇 is the extractedvisual representations by VAM. Different from the deterministicweights 𝒘 in Equation 6, we need to learn a weight distribution ˜ 𝒘 with the uncertainty that benefits the E&E decision making. 𝜖 areindependent and identically normally distributed random variables: 𝜖 ∼ N ( , 𝜎 ) (16)According to Bayes theorem, if the prior distribution of ˜ 𝒘 and 𝜎 isconjugate to the data’s likelihood function, the posterior probabilitydistributions can be derived analytically. And then Thompson Sam-pling, as known as Posterior Sampling, is able to tackles the E&Edilemma by maintaining the posterior over models and selectingcreatives in proportion to the probability that they are optimal. Wemodel the prior joint distribution of ˜ 𝒘 and 𝜎 as: 𝜋 ( ˜ 𝒘 , 𝜎 ) = 𝜋 ( ˜ 𝒘 | 𝜎 ) 𝜋 ( 𝜎 ) ,𝜎 ∼ 𝐼𝐺 ( 𝑎, 𝑏 ) 𝑎𝑛𝑑 ˜ 𝒘 | 𝜎 ∼ N ( 𝜇, 𝜎 Σ − ) (17)where the 𝐼𝐺 (·) is an Inverse Gamma whose prior hyperparametersare set to 𝑎 = 𝑏 = 𝜂 > N (·) is a Gaussian distribution withthe initial parameters Σ = 𝜆𝐼𝑑 . Note that 𝜇 is initialized as thelearned weights 𝒘 of VAM in Equation 6. It can provide a betterprior hyperparameters that further enhance the performance in WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang the cold-starting phase. We call it VAM-Warmup and the results isshown in Figure 5(b).Because we have chosen a conjugate prior, the posterior at time 𝑡 can be derived as Σ ( 𝑡 ) = 𝒇 𝑇 𝒇 + Σ 𝜇 ( 𝑡 ) = Σ ( 𝑡 ) − ( Σ 𝜇 + 𝒇 𝑇 𝒚 ) 𝑎 ( 𝑡 ) = 𝑎 + 𝑡 / 𝑏 ( 𝑡 ) = 𝑏 + ( 𝒚 𝑇 𝒚 + 𝜇 𝑇 Σ 𝜇 − 𝜇 ( 𝑡 ) 𝑇 Σ ( 𝑡 ) 𝜇 ( 𝑡 )) (18)where 𝒇 ∈ R 𝑡 × 𝑑 is a matrix that contain the content features forprevious impressions and 𝒚 ∈ R 𝑡 × is the feedback rewards. Afterupdating the above parameters at 𝑡 -th impression, we obtain theweight distributions with uncertainty estimation. We draw theweights 𝒘 ( 𝑡 ) from the learned distribution N ( 𝜇 ( 𝑡 ) , 𝜎 ( 𝑡 ) Σ ( 𝑡 ) − ) and select the best creative for product 𝐼 𝑛 as 𝐶 𝑛 = arg max 𝑐 ∈{ 𝐶 𝑛 ,𝐶 𝑛 , ··· ,𝐶 𝑛𝑀 } (N 𝑓 𝑒𝑎𝑡 ( 𝑐 )) 𝑇 𝒘 ( 𝑡 ) (19) The above model makes the weight distributions sharedby all the products.

This simple linear assumption works well forsmall datasets, but becomes inferior when dealing with industrialdata. For example, bright and vivid colors will be more attractivefor women’s top while concise colors are more proper for 3C digitalaccessories. In addition to this product-wise characteristic, a cre-ative may contain a unique designed attribute that is not expressedby the shared weights. Hence, it is helpful to have weights thathave both shared and non-shared components.We extend the Equation 15 to the following hybrid model by com-bining product-wise and creative-specific linear terms. For creative 𝐶 𝑛𝑚 , it can be formulated as 𝑦 𝑛𝑚 = 𝒇 𝑛𝑇𝑚 𝒘 𝑛 + 𝒇 𝑛𝑇𝑚 𝒘 𝑛𝑚 (20)where 𝒘 𝑛 and 𝒘 𝑛𝑚 are product-wise and creative-specific parameters,and they are disjoinly optimized by Equation 18. Furthermore, wepropose an fusion strategy to adaptively combine these two termsinstead of the simple addition 𝑦 𝑛𝑚 = ( − 𝜆 ) 𝒇 𝑛𝑇𝑚 𝒘 𝑛 + 𝜆 𝒇 𝑛𝑇𝑚 𝒘 𝑛𝑚 (21)where 𝜆 = ( + 𝑒 − 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐼𝑛 )+ 𝜃 𝜃 ) − is a sigmoid function withrescale 𝜃 and offset 𝜃 . We find that if the impressions are inade-quate, the product-wise parameters are learned better because itmake use of the knowledge among all candidate creatives. Other-wise, the creative-specific term outperforms the shared one dueto the sufficient feedback observations. The above procedure isshown in Algorithm 1. Because our hybrid model updates the pa-rameters of each product independently, we take 𝐼 𝑛 as exampleand adopt ( 𝑎 𝑛 (·) ,𝑏 𝑛 (·) ,𝜇 𝑛 (·) , Σ 𝑛 (·)) and ( 𝑎 𝑛𝑚 (·) ,𝑏 𝑛𝑚 (·) ,𝜇 𝑛𝑚 (·) , Σ 𝑛𝑚 (·)) to represent the shared and specific parameters. The distributionsdescribe the uncertainty in weights which is related to impressednumber: if there is less data, the model relies more on the visualevaluation results; Otherwise, the likelihood will reduce the pri-ori effect so as to converge to the observation data. In order to fitthe complex industrial data, we extend the shared linear model tothe hybrid version, which consider both product-level knowledge Algorithm 1:

Hybrid Bandit Model

Input: 𝑇 >

0, product 𝐼 𝑛 , visual representations ofcandidate creatives 𝒇 𝑛 , 𝒇 𝑛 , · · · , 𝒇 𝑛𝑚 , · · · , 𝒇 𝑛𝑀 Initialize the 𝑎 , 𝑏 , 𝜇 and Σ ; 𝑎 𝑛 ( ) ← 𝑎 , 𝑏 𝑛 ( ) ← 𝑏 , 𝜇 𝑛 ( ) ← 𝜇 , Σ 𝑛 ( ) ← Σ ; 𝑎 𝑛𝑚 ( ) ← 𝑎 , 𝑏 𝑛𝑚 ( ) ← 𝑏 , 𝜇 𝑛𝑚 ( ) ← 𝜇 , Σ 𝑛𝑚 ( ) ← Σ ; for 𝑡 = , , , . . . ,𝑇 do 𝜆 = ( + 𝑒 − 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐼𝑛 )+ 𝜃 𝜃 ) − ; for 𝑚 = , , , . . . , 𝑀 do Sample 𝜎 𝑛 from 𝐼𝐺 ( 𝑎 𝑛 ( 𝑡 − ) , 𝑏 𝑛 ( 𝑡 − )) ; Sample 𝒘 𝑛 from N ( 𝜇 𝑛 ( 𝑡 − ) , 𝜎 𝑛 Σ 𝑛 ( 𝑡 − ) − ) ; Sample 𝜎 𝑛𝑚 from 𝐼𝐺 ( 𝑎 𝑛𝑚 ( 𝑡 − ) , 𝑏 𝑛𝑚 ( 𝑡 − )) ; Sample 𝒘 𝑛𝑚 from N ( 𝜇 𝑛𝑚 ( 𝑡 − ) , 𝜎 𝑛𝑚 Σ 𝑛𝑚 ( 𝑡 − ) − ) ; 𝑦 𝑛𝑚 = ( − 𝜆 ) 𝒇 𝑛𝑇𝑚 𝒘 𝑛 + 𝜆 𝒇 𝑛𝑇𝑚 𝒘 𝑛𝑚 ; end 𝑘 = arg max ( 𝑦 𝑛 , . . . , 𝑦 𝑛𝑚 , . . . , 𝑦 𝑛𝑀 ) ; Display the creative 𝐶 𝑛𝑘 , and get the reward; Update 𝑎 𝑛 ( 𝑡 ) , 𝑏 𝑛 ( 𝑡 ) , 𝜇 𝑛 ( 𝑡 ) , Σ 𝑛 ( 𝑡 ) by the historical dataof product 𝐼 𝑛 and Equation 18; Update 𝑎 𝑛𝑘 ( 𝑡 ) , 𝑏 𝑛𝑘 ( 𝑡 ) , 𝜇 𝑛𝑘 ( 𝑡 ) , Σ 𝑛𝑘 ( 𝑡 ) by the historical dataof creative 𝐶 𝑛𝑘 and Equation 18; Set the other parameters of time 𝑡 as the same asprevious time ( 𝑡 − ) ; 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐼 𝑛 ) ← 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐼 𝑛 ) + end and creative-specific information, and fused by empirical attentionweights. Dataset Preparation.

The description of CreativeRanking datais presented in Section3.1. The original images and rewards foreach creative are provided in the order of displaying. For VAM, weaggregate the number of impressions and clicks to produce ˆ

𝐶𝑇 𝑅 byEquation 13 on training set, and train the VAM using the loss func-tion in Equation 11. For HBM, we update the policy by providingthe visual representations extracted by VAM and the impressiondata like Equation 4. Note the interaction and policy updating pro-cedure (see Algorithm 1) of HBM is conducted on the test set forsimulating the online situations. We record the sequential interac-tions and rewards to measure the performance (see Algorithm 2and Equation 21). Validation is used for hyperparameter tuning.In addition to the CreativeRanking data, we also evaluate themethods on a public dataset, called Mushroom. Since there is nopublic dataset for creative ranking yet, we test the proposed hybridbandit model on this standard dataset. The Mushroom Dataset[26] contains 22 attributes for each mushroom, and two categories:poisonous and safe. Eating a safe mushroom will receive reward + + −

35 otherwise. Not eating will provide no reward.We follow the protocols in [24], and interact for 50000 rounds.

Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Evaluation Metrics.

For CreativeRanking data, we present twoevaluation metrics to measure the performance, named simulatedCTR ( 𝑠𝐶𝑇 𝑅 ) and cumulative regret (

𝑅𝑒𝑔𝑟𝑒𝑡 ), respectively.

Simulated CTR ( 𝑠𝐶𝑇 𝑅 ) is a practical metric which is quite closeto the online performance. The details are shown in Algorithm 2.It replays the recorded impression data for all products. For eachproduct, the policy will play 𝑇 𝑛 rounds by receiving the recordeddata ( 𝐶, 𝑦 ) , and selects the best creative according to the predictedscores. If the selected one is the same as the 𝐶 , the impressionnumber, click number and policy itself will be updated (see line 3to 14 in Algorithm 2).Take HBM as an example, algorithm 1 shows the online updateprocess. To test the HBM by using offline data, we can change theaction “display and update” (line 14 to 18 in Algorithm 1) to theconditioned version in the line 8 to 12 in Algorithm 2. Cumulative regret (

𝑅𝑒𝑔𝑟𝑒𝑡 ) is commonly used for evaluatingbandit models. It is defined as 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝐸 [ 𝑟 ∗ − 𝑟 ] (22)where 𝑟 ∗ is the cumulative reward of the optimal policy, i.e., thepolicy that always selects the action with highest expected rewardgiven the context [24]. Specifically, we select the optimal creativefor our dataset, and calculate the 𝑅𝑒𝑔𝑟𝑒𝑡 as 𝑅𝑒𝑔𝑟𝑒𝑡 = (cid:205) 𝑁𝑛 = 𝑐𝑙𝑖𝑐𝑘 ( 𝐶 𝑛 ) (cid:205) 𝑁𝑛 = 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 ( 𝐶 𝑛 ) − 𝑠𝐶𝑇 𝑅 (23)where 𝑠𝐶𝑇 𝑅 should be produced by Algorithm 2 first. And the 𝐶 𝑛 is selected by calculating ˆ 𝐶𝑇 𝑅 in Equation 2 on the test set.For Mushroom, we follow the definition of cumulative regret in[24] to evaluate the models.

The model was implemented with Pytorch [22]. We adopt deepresidual network (ResNet-18)[17] pretrained on ImageNet classifi-cation [10] as backbone, and the model is finetuned with Creative

Algorithm 2:

Evaluation Metrics - 𝑠𝐶𝑇 𝑅

Input: impression data, policy 𝜋 Output: 𝑠𝐶𝑇 𝑅 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠 ← 𝑐𝑙𝑖𝑐𝑘𝑠 ← for 𝑛 = , , , . . . , 𝑁 do for 𝑡 = , , , . . . ,𝑇 𝑛 do Get next impression (C, y); Get predicated scores ( 𝑦 𝑛 , . . . , 𝑦 𝑛𝑀 ) by policy 𝜋 ; 𝑘 = arg max ( 𝑦 𝑛 , . . . , 𝑦 𝑛𝑀 ) ; if 𝐶 𝑛𝑘 = 𝐶 then 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠 ← 𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + 𝑐𝑙𝑖𝑐𝑘𝑠 ← 𝑐𝑙𝑖𝑐𝑘𝑠 + 𝑦 ; update policy 𝜋 by data (C, y); end end end 𝑠𝐶𝑇 𝑅 = 𝑐𝑙𝑖𝑐𝑘𝑠𝑖𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠 ; return 𝑠𝐶𝑇 𝑅 Ranking task. For VAM, we use stochastic gradient descent (SGD)with a mini-batch of 64 per GPU. The learning rate is initially setto 0.01 and then gradually decreased to 𝑒 −

4. The training processlasts 30 epochs on the datasets. For HBM, we extract the featurerepresentations 𝒇 𝑛𝑚 from VAM, and update the weights distribution 𝒘 𝑛𝑚 and 𝒘 𝑛 by using bayesian regression. In this subsection, we show the performance of the related methodsin Table 1 and Figure 5. The methods are divided into some groups: auniform strategy, context-free bandit models, linear bandit models,neural bandit models and our proposed methods. Table 1 presentsthe

𝑅𝑒𝑔𝑟𝑒𝑡 and 𝑠𝐶𝑇 𝑅 of all above models on both Mushroom andCreativeRanking datasets, and our methods - (NN/VAM-HBM) ex-hibits state-of-the-art results compared to the related models. Wealso conduct further analysis by showing the reward tendency ofconsecutive 15 days in Figure 5. Daily 𝑠𝐶𝑇 𝑅 evaluates the model foreach day independently, showing the flexibility of the policy wheninteracting with the feedback. And cumulative 𝑠𝐶𝑇 𝑅 presents thecumulative rewards up to the specific day which is used to measurethe overall performance.

Uniform : The baseline strategy that randomly selects an action(eat/not eat for Mushroom and one creative for CreativeRanking).Because this strategy has neither prior knowledge nor abilities oflearning from the data, it gets poor performance on the test sets.

Context-free Bandit Models : Epsilon-greedy [12], Thompsonsampling [25] and Upper Confidence Bounds (UCB) approaches [2]are simple yet effective strategies to deal with the bandit problem.They rely on history impression data (click/non-click) and keepupdating their strategies. However, for the cold-start stage, theymight randomly choose a creative like “Uniform” strategy (orangelines in Figure 5(c) in the first few days). We find that their curvesare gradually rising, but without prior information, the overallperformance is inferior to the other models.

Linear Bandit Models : The linear bandit model is an extensionto the context-free method by incorporating contextual information.For Mushroom, we adopt the 22 attributes to describe a mushroom,such as shape, color and so on. The

𝑅𝑒𝑔𝑟𝑒𝑡 is reduced when com-bining the side information. For CreativeRanking, we use colordistribution [3] to represent a creative, and update the linear payofffunctions. From the results in Table 1, the linear models achievebetter results than the context-free methods, but they still face theproblem of lacking representational power.

Neural Bandit Models : The neural bandit models add a linearregression on top of the neural network. In Table 1, “NN” denotesfully connected layers that used for extracting mushroom repre-sentations. For CreativeRanking, all these neural models use ourVAM as feature extractor, and adopt different E&E policies. Figure5(a) reveals some interesting observations: (1) The orange and bluelines represent the E-greedy and VAM-Greedy, respectively. Withthe visual priors, VAM-Greedy achieves much better performanceat the beginning (about 5% CTR lift), which demonstrates the ef-fectiveness of the visual evaluation. (2) Because VAM-Greedy is agreedy strategy that lack of exploration, it becomes mediocre in thelong run. When incorporating E&E model - HBM, our VAM-HBMoutperforms the other baselines by a significant margin. Besides,

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang Mushroom CreativeRankingEvaluation Metrics Regret (%) Regret (%) sCTR (%)Uniform 100 100 2.950Context-free Bandit Models (Orange lines)E-Greedy[12] 52.99 87.22 3.166Thompson Sampling[25] 52.49 87.69 3.158UCB[2] 52.42 87.04 3.169Linear Bandit Models (Green lines)LinGreedy[24] 14.28 91.72 3.090LinThompson[24] 2.37 85.68 3.192LinUCB[19] 10.27 85.50 3.195Neural Bandit Models (Blue lines)NN/VAM-Greedy 6.68 84.11 3.219NN/VAM-Thompson[24] 2.22 83.02 3.237NN/VAM-UCB 7.51 83.91 3.222NN/VAM-Dropout[13] 5.57 84.32 3.215Our Methods (red lines)VAM-Warmup - 79.70 3.293

NN/VAM-HBM 1.93 78.11 3.320Table 1: Performance comparison with state-of-the-art sys-tems on both

Mushroom and

CreativeRanking test set.

𝑅𝑒𝑔𝑟𝑒𝑡 is Normalized with respect to the performance of Uniform. (a) Ours vs (E-Greedy & VAM-Greedy) (b) Ours vs VAM-Thompson(c) Daily sCTR for all methods (d) Accumulated sCTR for all methods

Days DaysDays DaysDays Days

Figure 5: Reward tendency of consecutive 15 days on

Cre-ativeRanking . we also use Dropout as a Bayesian approximation[13], but it is notable estimate the uncertainty as accurate as the other policies. Our Methods : We propose VAM-Warmup that initialize the 𝜇 in bandit model by learned weights in VAM. By comparing red andblue dashed lines in Figure 5(b), we find the parameters with priordistributions improves 1.7% CTR for overall performance. In addi-tion, we extend the model by adding creative-specific parameters,named VAM-HBM, and it further enhances the model capacity andachieves the state-of-the-art result, especially the impressions forcreatives become adequate (see solid red line in Figure 5(b)(c)(d)).For Mushroom dataset, in order to demonstrate the idea, we clusterthe data into 2 groups by attribute “bruises”, each maintaining theindividual parameters. When combining the individual and sharedparameters by fusion weights in Equation 21, the model reducesthe 𝑅𝑒𝑔𝑟𝑒𝑡 to 1.93. Note that we use the default hyperparametersprovided by NeuralLinear without carefully tuning.

In this subsection, we conduct an ablation study on CreativeRankingdataset so as to validate the effectiveness of each component inthe VAM, including list-wise ranking loss, point-wise auxiliaryregression loss and noise mitigation. Besides, we also compare ourVAM with “learning-to-rank” visual models (including aesthetic

Methods Base (a) (b) (c) (d)Point-wise Loss? √ √ √

List-wise Loss? √ √ √

Noise Mitigation? √ 𝑠𝐶𝑇 𝑅 ( % ) ↑ .

4% 3.167 ↑ .

4% 3.194 ↑ .

3% 3.219 ↑ . Table 2: Ablation study for each component in the VAM. 𝑠𝐶𝑇 𝑅 are performed on the

CreativeRanking test set and ↑ 𝑠𝐶𝑇 𝑅 lift is calculated by ( 𝑠𝐶𝑇𝑅 (∗)− 𝑠𝐶𝑇𝑅 ( 𝑏𝑎𝑠𝑒 )) 𝑠𝐶𝑇𝑅 ( 𝑏𝑎𝑠𝑒 ) ∗ . models). We show the results in Table 2 and Table 3 to demonstratethe consistent improvements. Base in Table 2 stands for the baseline result. We adopt “uniform”strategy that randomly choose a creative among the candidates.The baseline is 2.950% for 𝑠𝐶𝑇 𝑅 . Method (a) and (b) : Method (a) and (b) utilize point-wise (Equa-tion 10) and list-wise loss (Equation 9) as the objective function, re-spectively. Although the model has never seen the products/creativeson the test set before, it has learned general patterns to identifymore attractive creatives. Moreover, the ranking loss concentrateson the top-1 probability learning which is more suitable than thepoint-wise objective for our scenarios. The simple version (b) canimprove the 𝑠𝐶𝑇 𝑅 by 7 . Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

C-1 C-2 C-3 C-4 C-5 C-6

C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6C-1 C-2 C-3 C-4 C-5 C-6

C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4 C-5 C-6 C-1 C-2 C-3 C-4

C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4

C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4C-1 C-2 C-3 C-4C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4C-1 C-2 C-3 C-4

C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4 C-1 C-2 C-3 C-4

Day-2Day-1 Day-3 Day-4 Day-5 Day-2Day-1 Day-3 Day-4 Day-5 (a) Example-1: Proper priors

VAM+HBM (sCTR 5.91%)E-greedy (sCTR 3.48%)Thompson Sampling (sCTR 2.76%) (b) Example-2: Incorrect priors

VAM+HBM (sCTR 3.87%)E-greedy (sCTR 1.72%)Thompson Sampling (sCTR 3.38%)

Figure 6: Two typical cases that present the changing of strategies. The horizontal axis shows different creatives while thevertical axis is the probability of being displayed for creatives. “Proper priors” indicates that VAM provides right predictionsand “Incorrect priors” otherwise.

Ranking Loss sCTR (%)Pairwise Hinge Loss [7] 3.170Aesthetics Ranking Loss [18] 3.167Triplet Loss [28] 3.115Pairwise [34] 3.188VAM (Ours)

Table 3: Comparison with other “learn-to-rank” visual mod-els. All above models adopt ResNet-18 as backbone.

Method (c) : Method (c) combines the point-wise auxiliary regres-sion loss with the ranking objective. It not only learns the relativeorder of creative quality, but also the absolute CTRs. We find it isgood at fitting the real CTR distributions and achieve the betterperformance 3.194% (8.3% lift) for 𝑠𝐶𝑇 𝑅 . Method (d) : Method (d) contains label smooth and weighted sam-pler , both of which are designed for mitigating the label noise.Weighted sampler makes the model pay more attention to the sam-ples whose impression numbers are sufficient while label smoothaims to improve the label reliability. These two practical methodsfurther improve the 𝑠𝐶𝑇 𝑅 to 3.216%, lifting 9.1% in total.

Related Loss functions : Pair-wise and triplet loss are typicalloss functions for learning to rank problems. [7, 18, 28] adopt hingeloss that is used for "maximum-margin" classification between thebetter candidate and the other one. It only requires the better cre-ative to get higher score than the other one by a pre-defined margin,without consideration of the exact difference. Our loss functionin Equation 9 and 10 estimate their CTR gaps and produce moreaccurate differences. [34] employ [4] as their pair-wise framework.Compared to the pair-wise learning, we treat one product as atraining sample and use list of creatives as instances. It is moreefficient and suitable with real scenarios which will display the bestcreative for one impression. Thus, our method obtains the leadingperformance on 𝑠𝐶𝑇 𝑅 .In summary, the proposed list-wise method enables the modelfocus on learning creative qualities and obtains better generaliz-ability. Incorporating point-wise regression and noise mitigation 𝛾 in Equation 11 0.0 0.1 0.5 1.0 2.0Validation sCTR(%) 3.15 3.15 3.17 3.16 3.13Test sCTR(%) 3.17 3.19 3.22 3.18 3.15 Table 4: Val/Test 𝑠𝐶𝑇 𝑅 with different 𝛾 in Equation 11. 𝜃 / 𝜃 in 𝜆

125 150 17530 3.27%(3.32%) 3.28%(3.33%) 3.28%(3.31%)50 3.27%(3.31%) 3.28%(3.32%) 3.27%(3.32%)100 3.27%(3.31%) 3.27%(3.31%) 3.27%(3.31%)

Table 5: x%(x%) denotes val(test) 𝑠𝐶𝑇 𝑅 of different 𝜃 / 𝜃 in 𝜆 . techniques is able to enhance the model capacity of fitting thereal-world data. 𝛾 in Equation 11. We tune hyperparameters in the validation set. 𝛾 in Equation 11 is adopted to control the weight of point-wiseauxiliary loss. According to the validation results (see Table 4), wetake 𝛾 = .

5. It is consistent with our hypothesis that ranking lossshould play a more important role in the creative ranking task. 𝜃 / 𝜃 of 𝜆 in Equation 21. 𝜃 / 𝜃 control the rescale and offsetof 𝜆 in Equation 21. Optimal hyperparameters vary in differentreal-world platforms(e.g., offset is set to 150, around the meanimpression number of each creative). We find the final performanceis not sensitive to these hyperparameters (see Table 5). We choose 𝜃 =

50 and 𝜃 =

150 in our experiments.

Strategy Visualization.

We show two typical cases that exhibitthe changing of strategies. Figure 6 (a) shows the proper prior ofHBM. We believe that the best creative should have the largestdisplaying probability among candidates. If this expectation is sat-isfied, a blue bar is shown; otherwise, orange bars are shown. Itgrants most impression opportunities to creative C-5 from the firstday, while the other two methods spend 2 days to find the bestcreative. For another case that receives incorrect prior in Figure

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Shiyao Wang, Qi Liu ∗ , Tiezheng Ge, Defu Lian, and Zhiqiang Zhang Figure 7: Visualization of the learned VAM. The model paysattention to different regions adaptively, including products,models and the text on the creative.

CNN Visualization.

Besides ranking performance, we would liketo attain further insight into the learned VAM. To this end, wevisualize the response of our VAM according to the activations onthe high-level feature maps, and the resulting visualization is shownin Figure 7. By learning from the creative ranking, we find thatthe CNN pays attention to different regions adaptively, includingproducts, models and the text on the creative. As shown in thesecond row Figure 7, the VAM draw higher attention to the modelsrather than the products. It may caused by the reason that productsendorsed by celebrities are more attractive than simply displayingthe products. Besides, some textual information, such as descriptionand discount information, can also attract customers.

In this paper, we propose a hybrid bandit model with visual priors.To the best of our knowledge, this is the first time that formulatesthe creative ranking as a E&E problem with visual priors. The VAMadopts a list-wise ranking loss function for ordering the creativequality only by their contents. In addition to the ability of visualevaluation, we extend the model to be updated when receivingfeedback from online scenarios called HBM. Last but not the least,we construct and release a novel large-scale creative dataset named

CreativeRanking . We would like to draw more attention to this topicwhich benefits both the research community and website’s userexperience. We carried out extensive experiments, including perfor-mance comparison, ablation study and case study, demonstratingthe solid improvements of the proposed model.

REFERENCES [1] Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for ContextualBandits with Linear Payoffs. In

Proceedings of the 30th International Conference onMachine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshopand Conference Proceedings, Vol. 28) . JMLR.org, 127–135. http://proceedings.mlr.press/v28/agrawal13.html[2] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysisof the Multiarmed Bandit Problem.

Mach. Learn.

47, 2-3 (2002), 235–256. https://doi.org/10.1023/A:1013689704352[3] Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao,and Xiaoli Z. Fern. 2012. The impact of visual appearance on user response inonline display advertising. In

Proceedings of the 21st World Wide Web Conference,WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume) , Alain Mille,Fabien L. Gandon, Jacques Misselis, Michael Rabinovich, and Steffen Staab (Eds.).ACM, 457–458. https://doi.org/10.1145/2187980.2188075[4] Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, NicoleHamilton, and Gregory N. Hullender. 2005. Learning to rank using gradientdescent. In

Machine Learning, Proceedings of the Twenty-Second InternationalConference (ICML 2005), Bonn, Germany, August 7-11, 2005 (ACM InternationalConference Proceeding Series, Vol. 119) , Luc De Raedt and Stefan Wrobel (Eds.).ACM, 89–96.[5] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learningto rank: from pairwise approach to listwise approach. In

Machine Learning,Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis,Oregon, USA, June 20-24, 2007 (ACM International Conference Proceeding Series,Vol. 227) , Zoubin Ghahramani (Ed.). ACM, 129–136. https://doi.org/10.1145/1273496.1273513[6] Mark Capelo, Karan Aggarwal, and Pranjul Yadav. 2019. Combining Text andImage data for Product Recommendability Modeling. In .IEEE, 5992–5994. https://doi.org/10.1109/BigData47090.2019.9006197[7] Parag S. Chandakkar, Vijetha Gattupalli, and Baoxin Li. 2017. A ComputationalApproach to Relative Aesthetics.

CoRR abs/1704.01248 (2017). arXiv:1704.01248http://arxiv.org/abs/1704.01248[8] Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, and Xian-Sheng Hua. 2016. DeepCTR Prediction in Display Advertising. In

Proceedings of the 2016 ACM Conferenceon Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19,2016 , Alan Hanjalic, Cees Snoek, Marcel Worring, Dick C. A. Bulterman, BenoitHuet, Aisling Kelliher, Yiannis Kompatsiaris, and Jin Li (Eds.). ACM, 811–820.https://doi.org/10.1145/2964284.2964325[9] Haibin Cheng, Roelof van Zwol, Javad Azimi, Eren Manavoglu, Ruofei Zhang,Yang Zhou, and Vidhya Navalpakkam. 2012. Multimedia features for click pre-diction of new ads in display advertising. In

The 18th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD , Qiang Yang, DeepakAgarwal, and Jian Pei (Eds.). ACM, 777–785.[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Im-ageNet: A large-scale hierarchical image database. In . IEEE Computer Society, 248–255. https://doi.org/10.1109/CVPR.2009.5206848[11] Hossein Talebi Esfandarani and Peyman Milanfar. 2018. NIMA: Neural ImageAssessment.

IEEE Trans. Image Process.

27, 8 (2018), 3998–4011. https://doi.org/10.1109/TIP.2018.2831899[12] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, andJoelle Pineau. 2018. An Introduction to Deep Reinforcement Learning.

Found.Trends Mach. Learn.

11, 3-4 (2018), 219–354. https://doi.org/10.1561/2200000071[13] Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation:Representing Model Uncertainty in Deep Learning. In

Proceedings of the 33ndInternational Conference on Machine Learning, ICML 2016, New York City, NY,USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48) ,Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). JMLR.org, 1050–1059.http://proceedings.mlr.press/v48/gal16.html[14] Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huimin Yi,Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, et al. 2018. Image matters: Visuallymodeling user behaviors using advanced model server. In

Proceedings of the27th ACM International Conference on Information and Knowledge Management .2087–2095.[15] Dorota Glowacka. 2017. Bandit Algorithms in Interactive Information Retrieval.In

Proceedings of the ACM SIGIR International Conference on Theory of InformationRetrieval, ICTIR 2017, Amsterdam, The Netherlands, October 1-4, 2017 , Jaap Kamps,Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM,327–328. https://doi.org/10.1145/3121050.3121108[16] Dorota Glowacka. 2019. Bandit algorithms in recommender systems. In

Proceed-ings of the 13th ACM Conference on Recommender Systems . 574–575.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.

Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising WWW ’21, April 19–23, 2021, Ljubljana, Slovenia [18] Shu Kong, Xiaohui Shen, Zhe L. Lin, Radomír Mech, and Charless C. Fowlkes.2016. Photo Aesthetics Ranking Network with Attributes and Content Adaptation.In

Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, TheNetherlands, October 11-14, 2016, Proceedings, Part I (Lecture Notes in ComputerScience, Vol. 9905) , Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.).Springer, 662–679. https://doi.org/10.1007/978-3-319-46448-0_40[19] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In

Proceedingsof the 19th International Conference on World Wide Web, WWW 2010, Raleigh,North Carolina, USA, April 26-30, 2010 , Michael Rappa, Paul Jones, Juliana Freire,and Soumen Chakrabarti (Eds.). ACM, 661–670. https://doi.org/10.1145/1772690.1772758[20] Hu Liu, Jing Lu, Hao Yang, Xiwei Zhao, Sulong Xu, Hao Peng, Zehua Zhang,Wenjie Niu, Xiaokun Zhu, Yongjun Bao, et al. 2020. Category-Specific CNN forVisual-aware CTR Prediction at JD. com. In

Proceedings of the 26th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . 2686–2696.[21] Kaixiang Mo, Bo Liu, Lei Xiao, Yong Li, and Jie Jiang. 2015. Image FeatureLearning for Cold Start Problem in Display Advertising. In

Proceedings of theTwenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015,Buenos Aires, Argentina, July 25-31, 2015 , Qiang Yang and Michael J. Wooldridge(Eds.). AAAI Press, 3728–3734. http://ijcai.org/Abstract/15/524[22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in pytorch. (2017).[23] Doina Precup. 2000. Eligibility traces for off-policy policy evaluation.

ComputerScience Department Faculty Publication Series (2000), 80.[24] Carlos Riquelme, George Tucker, and Jasper Snoek. 2018. Deep Bayesian BanditsShowdown: An Empirical Comparison of Bayesian Deep Networks for ThompsonSampling. In .OpenReview.net. https://openreview.net/forum?id=SyYe6k-CW[25] Daniel Russo and Benjamin Van Roy. 2014. Learning to Optimize via PosteriorSampling.

Math. Oper. Res.

39, 4 (2014), 1221–1243. https://doi.org/10.1287/moor.2014.0650 [26] Jeff Schlimmer. 1981. Mushroom records drawn from the audubon society fieldguide to north american mushrooms.

GH Lincoff (Pres), New York (1981).[27] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. 2017. Customer acquisitionvia display advertising using multi-armed bandit experiments.

Marketing Science

36, 4 (2017), 500–522.[28] Katharina Schwarz, Patrick Wieschollek, and Hendrik PA Lensch. 2018. Will peo-ple like your image? learning the aesthetic space. In . IEEE, 2048–2057.[29] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-works for Large-Scale Image Recognition. In , Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556[30] Xuerui Wang, Wei Li, Ying Cui, Ruofei Zhang, and Jianchang Mao. 2011. Click-through rate estimation for rare events in online advertising. In

Online multimediaadvertising: Techniques and technologies . IGI Global, 1–12.[31] Yu Wang, Jixing Xu, Aohan Wu, Mantian Li, Yang He, Jinghe Hu, and Weipeng PYan. 2018. Telepath: Understanding users from a human vision perspective inlarge-scale recommender systems. In

Thirty-Second AAAI Conference on ArtificialIntelligence .[32] Mengyue Yang, Qingyang Li, Zhiwei Qin, and Jieping Ye. 2020. HierarchicalAdaptive Contextual Bandits for Resource Constraint based Recommendation. In

Proceedings of The Web Conference 2020 . 292–302.[33] Wenhui Yu, Huidi Zhang, Xiangnan He, Xu Chen, Li Xiong, and Zheng Qin. 2018.Aesthetic-based Clothing Recommendation. In

Proceedings of the 2018 World WideWeb Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018 ,Pierre-Antoine Champin, Fabien L. Gandon, Mounia Lalmas, and Panagiotis G.Ipeirotis (Eds.). ACM, 649–658. https://doi.org/10.1145/3178876.3186146[34] Zhichen Zhao, Lei Li, Bowen Zhang, Meng Wang, Yuning Jiang, Li Xu, FengkunWang, and Wei-Ying Ma. 2019. What You Look Matters?: Offline Evaluation ofAdvertising Creatives for Cold-start Problem. In