MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask
MMaskNet: Introducing Feature-Wise Multiplication to CTRRanking Models by Instance-Guided Mask
Zhiqiang Wang, Qingyun She, Junlin Zhang
Sina Weibo CorpBeijing, China{zhiqiang36,qingyun,junlin6}@staff.weibo.com
ABSTRACT
Click-Through Rate(CTR) estimation has become one of the mostfundamental tasks in many real-world applications and it’s impor-tant for ranking models to effectively capture complex high-orderfeatures. Shallow feed-forward network is widely used in manystate-of-the-art DNN models such as FNN, DeepFM and xDeepFM toimplicitly capture high-order feature interactions. However, someresearch has proved that addictive feature interaction, particularfeed-forward neural networks, is inefficient in capturing commonfeature interaction. To resolve this problem, we introduce specificmultiplicative operation into DNN ranking system by proposinginstance-guided mask which performs element-wise product bothon the feature embedding and feed-forward layers guided by inputinstance. We also turn the feed-forward layer in DNN model intoa mixture of addictive and multiplicative feature interactions byproposing MaskBlock in this paper. MaskBlock combines the layernormalization, instance-guided mask, and feed-forward layer andit is a basic building block to be used to design new ranking modelunder various configurations. The model consisting of MaskBlockis called MaskNet in this paper and two new MaskNet models areproposed to show the effectiveness of MaskBlock as basic buildingblock for composing high performance ranking systems. The ex-periment results on three real-world datasets demonstrate that ourproposed MaskNet models outperform state-of-the-art models suchas DeepFM and xDeepFM significantly, which implies MaskBlockis an effective basic building unit for composing new high perfor-mance ranking systems.
ACM Reference Format:
Zhiqiang Wang, Qingyun She, Junlin Zhang. 2021. MaskNet: IntroducingFeature-Wise Multiplication to CTR Ranking Models by Instance-GuidedMask. In
Proceedings of ACM Conference (Conference’17).
ACM, New York,NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Click-through rate (CTR) prediction is to predict the probability of auser clicking on the recommended items. It plays important role inpersonalized advertising and recommender systems. Many models
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn have been proposed to resolve this problem such as Logistic Regres-sion (LR) [15], Polynomial-2 (Poly2) [16], tree-based models [7],tensor-based models [11], Bayesian models [5], and Field-aware Fac-torization Machines (FFMs) [10]. In recent years, employing DNNsfor CTR estimation has also been a research trend in this field andsome deep learning based models have been introduced such asFactorization-Machine Supported Neural Networks(FNN)[23], At-tentional Factorization Machine (AFM)[3], wide & deep(W&D)[21],DeepFM[6], xDeepFM[12] etc.Feature interaction is critical for CTR tasks and it’s important forranking model to effectively capture these complex features. MostDNN ranking models such as FNN , W&D, DeepFM and xDeepFMuse the shallow MLP layers to model high-order interactions in animplicit way and it’s an important component in current state-of-the-art ranking systems.However, Alex Beutel et.al [2] have proved that addictive featureinteraction, particular feed-forward neural networks, is inefficientin capturing common feature crosses. They proposed a simple buteffective approach named "latent cross" which is a kind of multiplica-tive interactions between the context embedding and the neuralnetwork hidden states in RNN model. Recently, Rendle et.al’s work[17] also shows that a carefully configured dot product baselinelargely outperforms the MLP layer in collaborative filtering. Whilea MLP can in theory approximate any function, they show thatit is non-trivial to learn a dot product with an MLP and learninga dot product with high accuracy for a decently large embeddingdimension requires a large model capacity as well as many trainingdata. Their work also proves the inefficiency of MLP layer’s abilityto model complex feature interactions.Inspired by "latent cross"[2] and Rendle’s work [17], we careabout the following question: Can we improve the DNN rankingsystems by introducing specific multiplicative operation into it tomake it efficiently capture complex feature interactions?In order to overcome the problem of inefficiency of feed-forwardlayer to capture complex feature cross, we introduce a special kindof multiplicative operation into DNN ranking system in this paper.First, we propose an instance-guided mask performing element-wise production on the feature embedding and feed-forward layer.The instance-guided mask utilizes the global information collectedfrom input instance to dynamically highlight the informative ele-ments in feature embedding and hidden layer in a unified manner.There are two main advantages for adopting the instance-guidedmask: firstly, the element-wise product between the mask and hid-den layer or feature embedding layer brings multiplicative opera-tion into DNN ranking system in unified way to more efficientlycapture complex feature interaction. Secondly, it’s a kind of fine-gained bit-wise attention guided by input instance which can both a r X i v : . [ c s . I R ] F e b onference’17, July 2017, Washington, DC, USA Zhiqiang Wang, Qingyun She, Junlin Zhang weaken the influence of noise in feature embedding and MLP layerswhile highlight the informative signals in DNN ranking systems.By combining instance-guided mask, a following feed-forwardlayer and layer normalization, MaskBlock is proposed by us to turnthe commonly used feed-forward layer into a mixture of addic-tive and multiplicative feature interactions. The instance-guidedmask introduces multiplicative interactions and the following feed-forward hidden layer aggregates the masked information in orderto better capture the important feature interactions. While the layernormalization can ease optimization of the network.MaskBlock can be regarded as a basic building block to designnew ranking models under some kinds of configuration. The modelconsisting of MaskBlock is called MaskNet in this paper and twonew MaskNet models are proposed to show the effectiveness ofMaskBlock as basic building block for composing high performanceranking systems.The contributions of our work are summarized as follows:(1) In this work, we propose an instance-guided mask perform-ing element-wise product both on the feature embeddingand feed-forward layers in DNN models. The global con-text information contained in the instance-guided mask isdynamically incorporated into the feature embedding andfeed-forward layer to highlight the important elements.(2) We propose a basic building block named MaskBlock whichconsists of three key components: instance-guided mask, afollowing feed-forward hidden layer and layer normalizationmodule. In this way, we turn the widely used feed-forwardlayer of a standard DNN model into a mixture of addictiveand multiplicative feature interactions.(3) We also propose a new ranking framework named MaskNetto compose new ranking system by utilizing MaskBlock asbasic building unit. To be more specific, the serial MaskNetmodel and parallel MaskNet model are designed based onthe MaskBlock in this paper. The serial rank model stacksMaskBlock block by block while the parallel rank model putsmany MaskBlocks in parallel on a sharing feature embeddinglayer.(4) Extensive experiments are conduct on three real-world datasetsand the experiment results demonstrate that our proposedtwo MaskNet models outperform state-of-the-art modelssignificantly. The results imply MaskBlock indeed enhanceDNN model’s ability of capturing complex feature interac-tions through introducing multiplicative operation into DNNmodels by instance-guided mask.The rest of this paper is organized as follows. Section 2 intro-duces some related works which are relevant with our proposedmodel. We introduce our proposed models in detail in Section 3.The experimental results on three real world datasets are presentedand discussed in Section 4. Section 5 concludes our work in thispaper. Many deep learning based CTR models have been proposed inrecent years and it is the key factor for most of these neural networkbased models to effectively model the feature interactions. Factorization-Machine Supported Neural Networks (FNN)[23] isa feed-forward neural network using FM to pre-train the embed-ding layer. Wide & Deep Learning[21] jointly trains wide linearmodels and deep neural networks to combine the benefits of mem-orization and generalization for recommender systems. However,expertise feature engineering is still needed on the input to thewide part of Wide & Deep model. To alleviate manual efforts infeature engineering, DeepFM[6] replaces the wide part of Wide &Deep model with FM and shares the feature embedding betweenthe FM and deep component.While most DNN ranking models process high-order featureinteractions by MLP layers in implicit way, some works explicitlyintroduce high-order feature interactions by sub-network. Deep &Cross Network (DCN)[20] efficiently captures feature interactionsof bounded degrees in an explicit fashion. Similarly, eXtreme DeepFactorization Machine (xDeepFM) [12] also models the low-orderand high-order feature interactions in an explicit way by proposinga novel Compressed Interaction Network (CIN) part. AutoInt[18]uses a multi-head self-attentive neural network to explicitly modelthe feature interactions in the low-dimensional space.
Feature-wise mask or gating has been explored widely in vision [8,19], natural language processing [4] and recommendation system[13,14]. For examples, Highway Networks [19] utilize feature gatingto ease gradient-based training of very deep networks. Squeeze-and-Excitation Networks[8] recalibrate feature responses by ex-plicitly multiplying each channel with learned sigmoidal maskvalues. Dauphin et al.[4] proposed gated linear unit (GLU) to utilizeit to control what information should be propagated for predict-ing the next word in the language modeling task. Gating or maskmechanisms are also adopted in recommendation systems. Ma etal. [14] propose a novel multi-task learning approach, Multi-gateMixture-of-Experts (MMoE), which explicitly learns to model taskrelationships from data. Ma et al.[13] propose a hierarchical gatingnetwork (HGN) to capture both the long-term and short-term userinterests. The feature gating and instance gating modules in HGNselect what item features can be passed to the downstream layersfrom the feature and instance levels, respectively.
Normalization techniques have been recognized as very effectivecomponents in deep learning. Many normalization approaches havebeen proposed with the two most popular ones being BatchNorm[9] and LayerNorm [1] . Batch Normalization (Batch Norm or BN)[9]normalizes the features by the mean and variance computed withina mini-batch. Another example is layer normalization (Layer Normor LN)[1] which was proposed to ease optimization of recurrentneural networks. Statistics of layer normalization are not com-puted across the 𝑁 samples in a mini-batch but are estimated in alayer-wise manner for each sample independently. Normalizationmethods have shown success in accelerating the training of deepnetworks. askNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask Conference’17, July 2017, Washington, DC, USA In this section, we first describe the feature embedding layer. Thenthe details of the instance-guided mask, MaskBlock and MaskNetstructure we proposed will be introduced. Finally the log loss as aloss function is introduced.
The input data of CTR tasks usually consists of sparse and densefeatures and the sparse features are mostly categorical type. Suchfeatures are encoded as one-hot vectors which often lead to exces-sively high-dimensional feature spaces for large vocabularies. Thecommon solution to this problem is to introduce the embeddinglayer. Generally, the sparse input can be formulated as: 𝑥 = [ 𝑥 , 𝑥 , ..., 𝑥 𝑓 ] (1)where 𝑓 denotes the number of fields, and 𝑥 𝑖 ∈ R 𝑛 denotes a one-hot vector for a categorical field with 𝑛 features and 𝑥 𝑖 ∈ R 𝑛 isvector with only one value for a numerical field. We can obtainfeature embedding 𝑒 𝑖 for one-hot vector 𝑥 𝑖 via: 𝑒 𝑖 = 𝑊 𝑒 𝑥 𝑖 (2)where 𝑊 𝑒 ∈ R 𝑘 × 𝑛 is the embedding matrix of 𝑛 features and 𝑘 isthe dimension of field embedding. The numerical feature 𝑥 𝑗 canalso be converted into the same low-dimensional space by: 𝑒 𝑗 = 𝑉 𝑗 𝑥 𝑗 (3)where 𝑉 𝑗 ∈ R 𝑘 is the corresponding field embedding with size 𝑘 .Through the aforementioned method, an embedding layer isapplied upon the raw feature input to compress it to a low dimen-sional, dense real-value vector. The result of embedding layer is awide concatenated vector: V 𝑒𝑚𝑏 = 𝑐𝑜𝑛𝑐𝑎𝑡 ( e , e , ..., e 𝑖 , ..., e 𝑓 ) (4)where 𝑓 denotes the number of fields, and e 𝑖 ∈ R 𝑘 denotes theembedding of one field. Although the feature lengths of input in-stances can be various, their embedding are of the same length 𝑓 × 𝑘 , where 𝑘 is the dimension of field embedding.We use instance-guided mask to introduce the multiplicative op-eration into DNN ranking system and here the so-called "instance"means the feature embedding layer of current input instance in thefollowing part of this paper. We utilize the global information collected from input instance byinstance-guided mask to dynamically highlight the informativeelements in feature embedding and feed-forward layer. For featureembedding, the mask lays stress on the key elements with moreinformation to effectively represent this feature. For the neurons inhidden layer, the mask helps those important feature interactionsto stand out by considering the contextual information in the inputinstance. In addition to this advantage, the instance-guided maskalso introduces the multiplicative operation into DNN rankingsystem to capture complex feature cross more efficiently.As depicted in Figure 1, two fully connected (FC) layers withidentity function are used in instance-guided mask. Notice that the
Figure 1: Neural Structure of Instance-Guided Mask input of instance-guided mask is always from the input instance,that is to say, the feature embedding layer.The first FC layer is called "aggregation layer" and it is a relativelywider layer compared with the second FC layer in order to bettercollect the global contextual information in input instance. Theaggregation layer has parameters 𝑊 𝑑 and here 𝑑 denotes the 𝑑 -thmask. For feature embedding and different MLP layers, we adoptdifferent instance-guided mask owning its parameters to learn tocapture various information for each layer from input instance.The second FC layer named "projection layer" reduces dimen-sionality to the same size as feature embedding layer 𝑉 𝑒𝑚𝑏 or hiddenlayer 𝑉 ℎ𝑖𝑑𝑑𝑒𝑛 with parameters 𝑊 𝑑 , Formally, 𝑉 𝑚𝑎𝑠𝑘 = 𝑊 𝑑 ( 𝑊 𝑑 𝑉 𝑒𝑚𝑏 + 𝛽 𝑑 ) + 𝛽 𝑑 (5)where 𝑉 𝑒𝑚𝑏 ∈ R 𝑚 = 𝑓 × 𝑘 refers to the embedding layer of inputinstance, 𝑊 𝑑 ∈ R 𝑡 × 𝑚 and 𝑊 𝑑 ∈ R 𝑧 × 𝑡 are parameters for instance-guided mask, 𝑡 and 𝑧 respectively denotes the neural number ofaggregation layer and projection layer, 𝑓 denotes the number offields and 𝑘 is the dimension of field embedding. 𝛽 𝑑 ∈ R 𝑡 × 𝑚 and 𝛽 𝑑 ∈ R 𝑧 × 𝑡 are learned bias of the two FC layers. Notice here thatthe aggregation layer is usually wider than the projection layerbecause the size of the projection layer is required to be equal to thesize of feature embedding layer or MLP layer. So we define the size 𝑟 = 𝑡 / 𝑧 as reduction ratio which is a hyper-parameter to controlthe ratio of neuron numbers of two layers.Element-wise product is used in this work to incorporate theglobal contextual information aggregated by instance-guided maskinto feature embedding or hidden layer as following: V 𝑚𝑎𝑠𝑘𝑒𝑑𝐸𝑀𝐵 = V 𝑚𝑎𝑠𝑘 ⊙ V 𝑒𝑚𝑏 V 𝑚𝑎𝑠𝑘𝑒𝑑𝐻𝐼𝐷 = V 𝑚𝑎𝑠𝑘 ⊙ V ℎ𝑖𝑑𝑑𝑒𝑛 (6)where V 𝑒𝑚𝑏 denotes embedding layer and V ℎ𝑖𝑑𝑑𝑒𝑛 means the feed-forward layer in DNN model, ⊙ means the element-wise productionbetween two vectors as follows: 𝑉 𝑖 ⊙ 𝑉 𝑗 = [ 𝑉 𝑖 · 𝑉 𝑗 , 𝑉 𝑖 · 𝑉 𝑗 , ...,𝑉 𝑖𝑢 · 𝑉 𝑗𝑢 ] (7)here 𝑢 is the size of vector 𝑉 𝑖 and 𝑉 𝑗 The instance-guided mask can be regarded as a special kind of bit-wise attention or gating mechanism which uses the global contextinformation contained in input instance to guide the parameteroptimization during training. The bigger value in 𝑉 𝑚𝑎𝑠𝑘 implies thatthe model dynamically identifies an important element in featureembedding or hidden layer. It is used to boost the element in vector 𝑉 𝑒𝑚𝑏 or 𝑉 ℎ𝑖𝑑𝑑𝑒𝑛 . On the contrary, small value in 𝑉 𝑚𝑎𝑠𝑘 will suppress onference’17, July 2017, Washington, DC, USA Zhiqiang Wang, Qingyun She, Junlin Zhang the uninformative elements or even noise by decreasing the valuesin the corresponding vector 𝑉 𝑒𝑚𝑏 or 𝑉 ℎ𝑖𝑑𝑑𝑒𝑛 .The two main advantages in adopting the instance-guided maskare: firstly, the element-wise product between the mask and hiddenlayer or feature embedding layer brings multiplicative operationinto DNN ranking system in unified way to more efficiently cap-ture complex feature interaction. Secondly, this kind of fine-gainedbit-wise attention guided by input instance can both weaken theinfluence of noise in feature embedding and MLP layers while high-light the informative signals in DNN ranking systems. To overcome the problem of the inefficiency of feed-forward layerto capture complex feature interaction in DNN models, we proposea basic building block named MaskBlock for DNN ranking systemsin this work, as shown in Figure 2 and Figure 3. The proposedMaskBlock consists of three key components: layer normalizationmodule ,instance-guided mask, and a feed-forward hidden layer.The layer normalization can ease optimization of the network. Theinstance-guided mask introduces multiplicative interactions forfeed-forward layer of a standard DNN model and feed-forwardhidden layer aggregate the masked information in order to bettercapture the important feature interactions. In this way, we turn thewidely used feed-forward layer of a standard DNN model into amixture of addictive and multiplicative feature interactions.First, we briefly review the formulation of LayerNorm.
Layer Normalization:
In general, normalization aims to ensure that signals have zero meanand unit variance as they propagate through a network to reduce"covariate shift" [9]. As an example, layer normalization (LayerNorm or LN)[1] was proposed to ease optimization of recurrentneural networks. Specifically, let 𝑥 = ( 𝑥 , 𝑥 , ..., 𝑥 𝐻 ) denotes thevector representation of an input of size 𝐻 to normalization layers.LayerNorm re-centers and re-scales input x as h = g ⊙ 𝑁 ( x ) + b , 𝑁 ( x ) = x − 𝜇𝛿 ,𝜇 = 𝐻 𝐻 ∑︁ 𝑖 = 𝑥 𝑖 , 𝛿 = (cid:118)(cid:117)(cid:116) 𝐻 𝐻 ∑︁ 𝑖 = ( 𝑥 𝑖 − 𝜇 ) (8)where ℎ is the output of a LayerNorm layer. ⊙ is an element-wiseproduction operation. 𝜇 and 𝛿 are the mean and standard deviationof input. Bias b and gain g are parameters with the same dimension 𝐻 . As one of the key component in MaskBlock, layer normalizationcan be used on both feature embedding and feed- forward layer. Forthe feature embedding layer, we regard each feature’s embeddingas a layer to compute the mean, standard deviation, bias and gainof LN as follows: 𝐿𝑁 _ 𝐸𝑀𝐵 ( V 𝑒𝑚𝑏 ) = 𝑐𝑜𝑛𝑐𝑎𝑡𝑒 (cid:16) 𝐿𝑁 ( e ) , 𝐿𝑁 ( e ) , ..., 𝐿𝑁 ( e 𝑖 ) , ..., 𝐿𝑁 ( e 𝑓 ) (cid:17) (9)As for the feed-forward layer in DNN model, the statistics of 𝐿𝑁 are estimated among neurons contained in the correspondinghidden layer as follows: 𝐿𝑁 _ 𝐻𝐼𝐷 ( V ℎ𝑖𝑑𝑑𝑒𝑛 ) = 𝑅𝑒𝐿𝑈 ( 𝐿𝑁 ( W 𝑖 X )) (10) Figure 2: MaskBlock on Feature EmbeddingFigure 3: MaskBlock on MaskBlock where X ∈ R 𝑡 refers to the input of feed-forward layer, W 𝑖 ∈ R 𝑚 × 𝑡 are parameters for the layer, 𝑡 and 𝑚 respectively denotes the sizeof input layer and neural number of feed-forward layer. Notice thatwe have two places to put normalization operation on the MLP:one place is before non-linear operation and another place is afternon-linear operation. We find the performance of the normalizationbefore non-linear consistently outperforms that of the normaliza-tion after non-linear operation. So all the normalization used inMLP part is put before non-linear operation in our paper as formula(4) shows. MaskBlock on Feature Embedding:
We propose MaskBlock by combining the three key elements: layernormalization, instance-guided mask and a following feed-forwardlayer. MaskBlock can be stacked to form deeper network. Accordingto the different input of each MaskBlock, we have two kinds ofMaskBlocks: MaskBlock on feature embedding and MaskBlock onMaskblock. We will firstly introduce the MaskBlock on featureembedding as depicted in Figure 2 in this subsection.The feature embedding V 𝑒𝑚𝑏 is the only input for MaskBlockon feature embedding. After the layer normalization operationon embedding V 𝑒𝑚𝑏 . MaskBlock utilizes instance-guided maskto highlight the informative elements in V 𝑒𝑚𝑏 by element-wiseproduct, Formally, V 𝑚𝑎𝑠𝑘𝑒𝑑𝐸𝑀𝐵 = V 𝑚𝑎𝑠𝑘 ⊙ 𝐿𝑁 _ 𝐸𝑀𝐵 ( V 𝑒𝑚𝑏 ) (11)where ⊙ means an element-wise production between the instance-guided mask and the normalized vector 𝐿𝑁 𝐸 𝑀𝐵 ( V 𝑒𝑚𝑏 ) , V 𝑚𝑎𝑠𝑘𝑒𝑑𝐸𝑀𝐵 denote the masked feature embedding. Notice that the input ofinstance-guided mask V 𝑚𝑎𝑠𝑘 is also the feature embedding 𝑉 𝑒𝑚𝑏 . askNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask Conference’17, July 2017, Washington, DC, USA Figure 4: Structure of Serial Model and Parallel Model
We introduce a feed-forward hidden layer and a following layernormalization operation in MaskBlock to better aggregate themasked information by a normalized non-linear transformation.The output of MaskBlock can be calculated as follows: V 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐿𝑁 _ 𝐻𝐼𝐷 ( 𝑊 𝑖 𝑉 𝑚𝑎𝑠𝑘𝑑𝐸𝑀𝐵 ) = 𝑅𝑒𝐿𝑈 ( 𝐿𝑁 ( W 𝑖 ( V 𝑚𝑎𝑠𝑘 ⊙ 𝐿𝑁 _ 𝐸𝑀𝐵 ( V 𝑒𝑚𝑏 )))) (12)where W 𝑖 ∈ R 𝑞 × 𝑛 are parameters of the feed-forward layer in the 𝑖 -th MaskBlock, 𝑛 denotes the size of V 𝑚𝑎𝑠𝑘𝑒𝑑𝐸𝑀𝐵 and 𝑞 meansthe size of neural number of the feed-forward layer.The instance-guided mask introduces the element-wise productinto feature embedding as a fine-grained attention while normaliza-tion both on feature embedding and hidden layer eases the networkoptimization. These key components in MaskBlock help the feed-forward layer capture complex feature cross more efficiently. MaskBlock on MaskBlock:
In this subsection, we will introduce MaskBlock on MaskBlock as de-picted in Figure 3. There are two different inputs for this MaskBlock:feature embedding 𝑉 𝑒𝑚𝑏 and the output 𝑉 𝑝𝑜𝑢𝑡𝑝𝑢𝑡 of the previousMaskBlock. The input of instance-guided mask for this kind ofMaskBlock is always the feature embedding 𝑉 𝑒𝑚𝑏 . MaskBlock uti-lizes instance-guided mask to highlight the important feature inter-actions in previous MaskBlock’s output 𝑉 𝑝𝑜𝑢𝑡𝑝𝑢𝑡 by element-wiseproduct, Formally, 𝑉 𝑚𝑎𝑠𝑘𝑒𝑑𝐻𝐼𝐷 = 𝑉 𝑚𝑎𝑠𝑘 ⊙ 𝑉 𝑝𝑜𝑢𝑡𝑝𝑢𝑡 (13)where ⊙ means an element-wise production between the instance-guided mask 𝑉 𝑚𝑎𝑠𝑘 and the previous MaskBlock’s output 𝑉 𝑝𝑜𝑢𝑡𝑝𝑢𝑡 , 𝑉 𝑚𝑎𝑠𝑘𝑒𝑑𝐻𝐼𝐷 denote the masked hidden layer.In order to better capture the important feature interactions,another feed-forward hidden layer and a following layer normaliza-tion are introduced in MaskBlock . In this way, we turn the widelyused feed-forward layer of a standard DNN model into a mixtureof addictive and multiplicative feature interactions to avoid theineffectiveness of those addictive feature cross models. The outputof MaskBlock can be calculated as follows: V 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐿𝑁 _ 𝐻𝐼𝐷 ( 𝑊 𝑖 𝑉 𝑚𝑎𝑠𝑘𝑑𝐻𝐼𝐷 ) = 𝑅𝑒𝐿𝑈 ( 𝐿𝑁 ( W 𝑖 ( V 𝑚𝑎𝑠𝑘 ⊙ V 𝑝𝑜𝑢𝑡𝑝𝑢𝑡 ))) (14) where 𝑊 𝑖 ∈ R 𝑞 × 𝑛 are parameters of the feed-forward layer in the 𝑖 -th MaskBlock, 𝑛 denotes the size of V 𝑚𝑎𝑠𝑘𝑒𝑑𝐻𝐼𝐷 and 𝑞 means thesize of neural number of the feed-forward layer. Based on the MaskBlock, various new ranking models can be de-signed according to different configurations. The rank model con-sisting of MaskBlock is called MaskNet in this work. We also pro-pose two MaskNet models by utilizing the MaskBlock as the basicbuilding block.
Serial MaskNet:
We can stack one MaskBlock after another to build the rankingsystem , as shown by the left model in Figure 4. The first blockis a MaskBlock on feature embedding and all other blocks areMaskBlock on Maskblock to form a deeper network. The predictionlayer is put on the final MaskBlock’s output vector. We call MaskNetunder this serial configuration as SerMaskNet in our paper. Allinputs of instance-guided mask in every MaskBlock come from thefeature embedding layer V 𝑒𝑚𝑏 and this makes the serial MaskNetmodel look like a RNN model with sharing input at each time step. Parallel MaskNet:
We propose another MaskNet by placing several MaskBlocks onfeature embedding in parallel on a sharing feature embedding layer,as depicted by the right model in Figure 4. The input of each block isonly the shared feature embedding V 𝑒𝑚𝑏 under this configuration.We can regard this ranking model as a mixture of multiple expertsjust as MMoE[14] does. Each MaskBlock pays attention to specifickind of important features or feature interactions. We collect theinformation of each expert by concatenating the output of eachMaskBlock as follows: V 𝑚𝑒𝑟𝑔𝑒 = 𝑐𝑜𝑛𝑐𝑎𝑡𝑒 ( V 𝑜𝑢𝑡𝑝𝑢𝑡 , V 𝑜𝑢𝑡𝑝𝑢𝑡 , ..., V 𝑖𝑜𝑢𝑡𝑝𝑢𝑡 , ..., V 𝑢𝑜𝑢𝑡𝑝𝑢𝑡 ) (15)where V 𝑖𝑜𝑢𝑡𝑝𝑢𝑡 ∈ R 𝑞 is the output of the 𝑖 -th MaskBlock and 𝑞 means size of neural number of feed-forward layer in MaskBlock, 𝑢 is the MaskBlock number.To further merge the feature interactions captured by each ex-pert, multiple feed-forward layers are stacked on the concatenationinformation V 𝑚𝑒𝑟𝑔𝑒 . Let H = V 𝑚𝑒𝑟𝑔𝑒 denotes the output of the onference’17, July 2017, Washington, DC, USA Zhiqiang Wang, Qingyun She, Junlin Zhang concatenation layer, then H is fed into the deep neural networkand the feed forward process is: H 𝑙 = 𝑅𝑒𝐿𝑈 ( W 𝑙 H 𝑙 − + 𝛽 𝑙 ) (16)where 𝑙 is the depth and ReLU is the activation function. W 𝑡 , 𝛽 𝑡 , H 𝑙 are the model weight, bias and output of the 𝑙 -th layer. The predic-tion layer is put on the last layer of multiple feed-forward networks.We call this version MaskNet as "ParaMaskNet" in the followingpart of this paper. To summarize, we give the overall formulation of our proposedmodel’ s output as: ^ 𝑦 = 𝛿 ( 𝑤 + 𝑛 ∑︁ 𝑖 = 𝑤 𝑖 𝑥 𝑖 ) (17)where ^ 𝑦 ∈ ( , ) is the predicted value of CTR, 𝛿 is the sigmoidfunction, 𝑛 is the size of the last MaskBlock’s output(SerMaskNet)or feed-forward layer(ParaMaskNet), 𝑥 𝑖 is the bit value of feed-forward layer and 𝑤 𝑖 is the learned weight for each bit value.For binary classifications, the loss function is the log loss: L = − 𝑁 𝑁 ∑︁ 𝑖 = 𝑦 𝑖 log ( ^ 𝑦 𝑖 ) + ( − 𝑦 𝑖 ) log ( − ^ 𝑦 𝑖 ) (18)where 𝑁 is the total number of training instances, 𝑦 𝑖 is the groundtruth of 𝑖 -th instance and ^ 𝑦 𝑖 is the predicted CTR. The optimizationprocess is to minimize the following objective function: 𝔏 =
L + 𝜆 ∥ Θ ∥ (19)where 𝜆 denotes the regularization term and Θ denotes the set ofparameters, including those in feature embedding matrix, instance-guided mask matrix, feed-forward layer in MaskBlock, and predic-tion part. In this section, we evaluate the proposed approaches on three real-world datasets and conduct detailed ablation studies to answer thefollowing research questions: • RQ1
Does the proposed MaskNet model based on the MaskBlockperform better than existing state-of-the-art deep learningbased CTR models? • RQ2
What are the influences of various components in theMaskBlock architecture? Is each component necessary tobuild an effective ranking system? • RQ3
How does the hyper-parameter of networks influencethe performance of our proposed two MaskNet models? • RQ4
Does instance-guided mask highlight the importantelements in feature embedding and feed-forward layers ac-cording to the input instance?In the following, we will first describe the experimental settings,followed by answering the above research questions.
The following three data sets are used in our ex-periments: (1)
Criteo Dataset:
As a very famous public real world displayad dataset with each ad display information and correspond-ing user click feedback, Criteo data set is widely used inmany CTR model evaluation. There are anonymous cate-gorical fields and continuous feature fields in Criteo dataset.(2) Malware Dataset:
Malware is a dataset from Kaggle com-petitions published in the Microsoft Malware prediction. Thegoal of this competition is to predict a Windows machine’sprobability of getting infected. The malware prediction taskcan be formulated as a binary classification problem like atypical CTR estimation task does.(3)
Avazu Dataset:
The Avazu dataset consists of several daysof ad click- through data which is ordered chronologically.For each click data, there are fields which indicate ele-ments of a single ad impression.We randomly split instances by for training , validationand test while Table 1 lists the statistics of the evaluation datasets. Table 1: Statistics of the evaluation datasets
Datasets
AUC (Area Under ROC) is used in ourexperiments as the evaluation metric. AUC’s upper bound is andlarger value indicates a better performance.RelaImp is also as work [22] does to measure the relative AUCimprovements over the corresponding baseline model as anotherevaluation metric. Since AUC is . from a random strategy, wecan remove the constant part of the AUC score and formalize theRelaImp as: 𝑅𝑒𝑙𝑎𝐼𝑚𝑝 = 𝐴𝑈𝐶 ( 𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑀𝑜𝑑𝑒𝑙 ) − . 𝐴𝑈𝐶 ( 𝐵𝑎𝑠𝑒 𝑀𝑜𝑑𝑒𝑙 ) − . − (20) We compare the performance of thefollowing CTR estimation models with our proposed approaches:FM, DNN, DeepFM, Deep&Cross Network(DCN), xDeepFM and Au-toInt Model, all of which are discussed in Section 2. FM is consideredas the base model in evaluation.
We implement all the models withTensorflow in our experiments. For optimization method, we usethe Adam with a mini-batch size of and a learning rate is setto . . Focusing on neural networks structures in our paper,we make the dimension of field embedding for all models to be afixed value of . For models with DNN part, the depth of hiddenlayers is set to , the number of neurons per layer is , all ac-tivation function is ReLU. For default settings in MaskBlock, thereduction ratio of instance-guided mask is set to . We conduct ourexperiments with Tesla 𝐾 GPUs. Criteo http://labs.criteo.com/downloads/download-terabyte-click-logs/ askNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask Conference’17, July 2017, Washington, DC, USA Table 2: Overall performance (AUC) of different models on three datasets(feature embedding size=10,our proposed two modelsboth have 3 MaskBlocks with same default settings.)Criteo Malware Avazu
AUC RelaImp AUC RelaImp AUC RelaImpFM 0.7895 0.00% 0.7166 0.00% 0.7785 0.00%DNN 0.8054 +5.35% 0.7246 +3.70% 0.7820 +1.26%DeepFM 0.8057 +5.46% 0.7293 +5.86% 0.7833 +1.72%DCN 0.8058 +5.49% 0.7300 +6.19% 0.7830 +1.62%xDeepFM 0.8064 +5.70% 0.7310 +6.65% 0.7841 +2.01%AutoInt 0.8051 +5.39% 0.7282 +5.36% 0.7824 +1.40%SerMaskNet 0.8119 +7.74%
ParaMaskNet
The overall performances of different models on three evaluationdatasets are show in the Table 2. From the experimental results, wecan see that:(1) Both the serial model and parallel model achieve better per-formance on all three datasets and obtains significant im-provements over the state-of-the-art methods. It can boostthe accuracy over the baseline FM by . to . , base-line DeepFM by . to . , as well as xDeepFM base-line by . to . . We also conduct a significance testto verify that our proposed models outperforms baselineswith the significance level 𝛼 = . .Though maskNet model lacks similar module such as CIN inxDeepFM to explicitly capture high-order feature interaction,it still achieves better performance because of the existenceof MaskBlock. The experiment results imply that MaskBlockindeed enhance DNN Model’s ability of capturing complexfeature interactions through introducing multiplicative op-eration into DNN models by instance-guided mask on thenormalized feature embedding and feed-forward layer.(2) As for the comparison of the serial model and parallel model,the experimental results show comparable performance onthree evaluation datasets. It explicitly proves that MaskBlockis an effective basic building unit for composing various highperformance ranking systems. In order to better understand the impact of each component inMaskBlock, we perform ablation experiments over key componentsof MaskBlock by only removing one of them to observe the per-formance change, including mask module, layer normalization(LN)and feed-forward network(FFN). Table 3 shows the results of ourtwo full version MaskNet models and its variants removing onlyone component.From the results in Table 3, we can see that removing eitherinstance-guided mask or layer normalization will decrease model’sperformance and this implies that both the instance-guided maskand layer normalization are necessary components in MaskBlockfor its effectiveness. As for the feed-forward layer in MaskBlock,its effect on serial model or parallel model shows difference. The Serial model’s performance dramatically degrades while it seemsdo no harm to parallel model if we remove the feed-forward layerin MaskBlock. We deem this implies that the feed-forward layer inMaskBlock is important for merging the feature interaction infor-mation after instance-guided mask. For parallel model, the multiplefeed-forward layers above parallel MaskBlocks have similar func-tion as feed-forward layer in MaskBlock does and this may produceperformance difference between two models when we remove thiscomponent.
Table 3: Overall performance (AUC) of MaskNet modelsremoving different component in MaskBlock on Criteodataset(feature embedding size= , MaskNet model has MaskBlocks.)
Model Name SerMaskNet ParaMaskNetFull 0.8119 0.8124-w/o Mask 0.8090 0.8093-w/o LN 0.8106 0.8103-w/o FFN 0.8085 0.8122
In the following part of the paper, we study the impacts of hyper-parameters on two MaskNet models, including ) the number offeature embedding size; ) the number of MaskBlock; and ) thereduction ratio in instance-guided mask module. The experimentsare conducted on Criteo dataset via changing one hyper-parameterwhile holding the other settings. The hyper-parameter experimentsshow similar trend in other two datasets. Number of Feature Embedding Size.
The results in Table 4show the impact of the number of feature embedding size on modelperformance. It can be observed that the performances of bothmodels increase when embedding size increases at the beginning.However, model performance degrades when the embedding size isset greater than for SerMaskNet model and for ParaMaskNetmodel. The experimental results tell us the models benefit fromlarger feature embedding size. onference’17, July 2017, Washington, DC, USA Zhiqiang Wang, Qingyun She, Junlin Zhang Table 4: Overall performance (AUC) of different feature em-bedding size of MaskNet Models on Criteo dataset(the num-ber of MaskBlock is ) Embedding Size 10 20 30 50 80SerMaskNet 0.8119 0.8123 0.8121 0.8125 0.8121ParaMaskNet 0.8124 0.8128 0.8131 0.8129 0.8129
Table 5: Overall performance (AUC) of different num-bers of MaskBlocks in MaskNet model on Criteodataset(embedding size = ) Block Number 1 3 5 7 9SerMaskNet 0.8110 0.8119 0.8126 0.8117 0.8115ParaMaskNet 0.8113 0.8124 0.8127 0.8128 0.8132
Table 6: Overall performance (AUC) of different size ofHidden Layer in Mask Module of MastBlock on Criteodataset.(embedding size = , number of MaskBlock is ) Reduction ratio 1 2 3 4 5SerMaskNet 0.8118 0.8119 0.8120 0.8117 0.8119ParaMaskNet 0.8124 0.8124 0.8122 0.8122 0.8124
Number of MaskBlock.
For understanding the influence of thenumber of MaskBlock on model’s performance, we conduct exper-iments to stack MaskBlock from to blocks for both MaskNetmodels. The experimental results are listed in the Table 5. For Ser-MaskNet model, the performance increases with more blocks at thebeginning until the number is set greater than . While the perfor-mance slowly increases when we continually add more MaskBlockinto ParaMaskNet model. This may indicates that more expertsboost the ParaMaskNet model’s performance though it’s more timeconsuming. Reduction Ratio in Instance-Guided Mask.
In order to ex-plore the influence of the reduction ratio in instance-guided mask,We conduct some experiments to adjust the reduction ratio from to by changing the size of aggregation layer. Experimental resultsare shown in Table 6 and we can observe that various reductionratio has little influence on model’s performance. This indicatesthat we can adopt small reduction ratio in aggregation layer in reallife applications for saving the computation resources. As discussed in Section in 3.2, instance-guided mask can be regardedas a special kind of bit-wise attention mechanism to highlight im-portant information based on the current input instance. We canutilize instance-guided mask to boost the informative elementsand suppress the uninformative elements or even noise in featureembedding and feed-forward layer.To verify this, we design the following experiment: After trainingthe SerMaskNet with blocks, we input different instances into themodel and observe the outputs of corresponding instance-guidedmasks. Figure 5: Distribution of Mask ValuesFigure 6: Mask Values of Two Expamples
Firstly, we randomly sample different instances fromCriteo dataset and observe the distributions of the produced valuesby instance-guided mask from different blocks. Figure 5 showsthe result. We can see that the distribution of mask values follownormal distribution. Over 50% of the mask values are small numbernear zero and only little fraction of the mask value is a relativelylarger number. This implies that large fraction of signals in featureembedding and feed-forward layer is uninformative or even noisewhich is suppressed by the small mask values. However, thereis some informative information boosted by larger mask valuesthrough instance-guided mask.Secondly, we randomly sample two instances and compare thedifference of the produced values by instance-guided mask. Theresults are shown in Figure 6. We can see that: As for the maskvalues for feature embedding, different input instances lead themask to pay attention to various areas. The mask outputs of instanceA pay more attention to the first few features and the mask valuesof instance B focus on some bits of other features. We can observethe similar trend in the mask values in feed-forward layer. Thisindicates the input instance indeed guide the mask to pay attentionto the different part of the feature embedding and feed-forwardlayer.
In this paper, we introduce multiplicative operation into DNN rank-ing system by proposing instance-guided mask which performselement-wise product both on the feature embedding and feed-forward layers. We also turn the feed-forward layer in DNN modelinto a mixture of addictive and multiplicative feature interactions byproposing MaskBlock by bombing the layer normalization, instance-guided mask, and feed-forward layer. MaskBlock is a basic buildingblock to be used to design new ranking model. We also propose twospecific MaskNet models based on the MaskBlock. The experimentresults on three real-world datasets demonstrate that our proposedmodels outperform state-of-the-art models such as DeepFM andxDeepFM significantly. askNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask Conference’17, July 2017, Washington, DC, USA
REFERENCES [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza-tion. arXiv preprint arXiv:1607.06450 (2016).[2] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed HChi. 2018. Latent cross: Making use of context in recurrent recommender systems.In
Proceedings of the Eleventh ACM International Conference on Web Search andData Mining . 46–54.[3] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In
Proceedings of the 1stworkshop on deep learning for recommender systems . ACM, 7–10.[4] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Lan-guage modeling with gated convolutional networks. In
Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 . JMLR. org, 933–941.[5] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich.2010. Web-scale bayesian click-through rate prediction for sponsored searchadvertising in microsoft’s bing search engine. Omnipress.[6] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: a factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247 (2017).[7] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predictingclicks on ads at facebook. In
Proceedings of the Eighth International Workshop onData Mining for Online Advertising . ACM, 1–9.[8] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In
Proceed-ings of the IEEE conference on computer vision and pattern recognition . 7132–7141.[9] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 (2015).[10] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In
Proceedings of the 10th ACMConference on Recommender Systems . ACM, 43–50.[11] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.
Computer
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 1754–1763.[13] Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical gating networks forsequential recommendation. In
Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . 825–833.[14] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018.Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In
Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining . 1930–1939.[15] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, and etal. 2013. Ad Click Prediction: A View from the Trenches. In
Proceedings of the19th ACM SIGKDD International Conference on Knowledge Discovery and DataMining (KDD ’13) . Association for Computing Machinery, New York, NY, USA,1222–1230. https://doi.org/10.1145/2487575.2488200[16] Steffen Rendle. 2010. Factorization machines. In . IEEE, 995–1000.[17] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neu-ral Collaborative Filtering vs. Matrix Factorization Revisited. arXiv preprintarXiv:2005.09683 (2020).[18] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang,and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In
Proceedings of the 28th ACM International Conferenceon Information and Knowledge Management . 1161–1170.[19] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highwaynetworks. arXiv preprint arXiv:1505.00387 (2015).[20] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross networkfor ad click predictions. In
Proceedings of the ADKDD’17 . ACM, 12.[21] Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua.2017. Attentional factorization machines: Learning the weight of feature interac-tions via attention networks. arXiv preprint arXiv:1708.04617 (2017).[22] Ruobing Xie, Cheng Ling, Yalong Wang, Rui Wang, Feng Xia, and Leyu Lin. 2020.Deep Feedback Network for Recommendation. 2491–2497. https://doi.org/10.24963/ijcai.2020/345[23] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-fieldcategorical data. In