[PDF] xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems

Abstract

Combinatorial features are essential for the success of many commercial models. Manually crafting these features usually comes with high cost due to the variety, volume and velocity of raw data in web-scale systems. Factorization based models, which measure interactions in terms of vector product, can learn patterns of combinatorial features automatically and generalize to unseen features as well. With the great success of deep neural networks (DNNs) in various fields, recently researchers have proposed several DNN-based factorization model to learn both low- and high-order feature interactions. Despite the powerful ability of learning an arbitrary function from data, plain DNNs generate feature interactions implicitly and at the bit-wise level. In this paper, we propose a novel Compressed Interaction Network (CIN), which aims to generate feature interactions in an explicit fashion and at the vector-wise level. We show that the CIN share some functionalities with convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We further combine a CIN and a classical DNN into one unified model, and named this new model eXtreme Deep Factorization Machine (xDeepFM). On one hand, the xDeepFM is able to learn certain bounded-degree feature interactions explicitly; on the other hand, it can learn arbitrary low- and high-order feature interactions implicitly. We conduct comprehensive experiments on three real-world datasets. Our results demonstrate that xDeepFM outperforms state-of-the-art models. We have released the source code of xDeepFM at \url{this https URL}.

Full PDF

xxDeepFM: Combining Explicit and Implicit Feature Interactionsfor Recommender Systems

Jianxun Lian

University of Science and Technologyof [email protected]

Xiaohuan Zhou

Beijing University of Posts [email protected]

Fuzheng Zhang

Microsoft [email protected]

Zhongxia Chen

University of Science and Technologyof [email protected]

Xing Xie

Microsoft [email protected]

Guangzhong Sun

University of Science and Technologyof [email protected]

ABSTRACT

Combinatorial features are essential for the success of many com-mercial models. Manually crafting these features usually comeswith high cost due to the variety, volume and velocity of raw datain web-scale systems. Factorization based models, which measureinteractions in terms of vector product, can learn patterns of com-binatorial features automatically and generalize to unseen featuresas well. With the great success of deep neural networks (DNNs)in various fields, recently researchers have proposed several DNN-based factorization model to learn both low- and high-order featureinteractions. Despite the powerful ability of learning an arbitraryfunction from data, plain DNNs generate feature interactions im-plicitly and at the bit-wise level. In this paper, we propose a novelCompressed Interaction Network (CIN), which aims to generatefeature interactions in an explicit fashion and at the vector-wiselevel. We show that the CIN share some functionalities with con-volutional neural networks (CNNs) and recurrent neural networks(RNNs). We further combine a CIN and a classical DNN into oneunified model, and named this new model eXtreme Deep Factor-ization Machine (xDeepFM). On one hand, the xDeepFM is ableto learn certain bounded-degree feature interactions explicitly; onthe other hand, it can learn arbitrary low- and high-order featureinteractions implicitly. We conduct comprehensive experiments onthree real-world datasets. Our results demonstrate that xDeepFMoutperforms state-of-the-art models. We have released the sourcecode of xDeepFM at https://github.com/Leavingseason/xDeepFM . CCS CONCEPTS • Information systems → Personalization ; •

Computing method-ologies → Neural networks ; Factorization methods ; Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

KEYWORDS

Factorization machines, neural network, recommender systems,deep learning, feature interactions

ACM Reference Format:

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie,and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and ImplicitFeature Interactions for Recommender Systems. In

KDD ’18: The 24th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining,August 19–23, 2018, London, United Kingdom.

ACM, New York, NY, USA,10 pages. https://doi.org/10.1145/3219819.3220023

Features play a central role in the success of many predictive sys-tems. Because using raw features can rarely lead to optimal results,data scientists usually spend a lot of work on the transformation ofraw features in order to generate best predictive systems [14, 24]or to win data mining games [21, 22, 26]. One major type of featuretransformation is the cross-product transformation over categoricalfeatures [5]. These features are called cross features or multi-wayfeatures , they measure the interactions of multiple raw features. Forinstance, a 3-way feature AND(user_organization=msra,item_category=deeplearning, time=monday) has value1 if the user works at Microsoft Research Asia and is shown a tech-nical article about deep learning on a Monday.There are three major downsides for traditional cross feature en-gineering. First, obtaining high-quality features comes with a highcost. Because right features are usually task-specific, data scien-tists need spend a lot of time exploring the potential patterns fromthe product data before they become domain experts and extractmeaningful cross features. Second, in large-scale predictive systemssuch as web-scale recommender systems, the huge number of rawfeatures makes it infeasible to extract all cross features manually.Third, hand-crafted cross features do not generalize to unseen inter-actions in the training data. Therefore, learning to interact featureswithout manual engineering is a meaningful task.Factorization Machines (FM) [32] embed each feature i to alatent factor vector v i = [ v i , v i , ..., v iD ] , and pairwise featureinteractions are modeled as the inner product of latent vectors: f ( ) ( i , j ) = ⟨ v i , v j ⟩ x i x j . In this paper we use the term bit to denotea element (such as v i ) in latent vectors. The classical FM can beextended to arbitrary higher-order feature interactions [2], but one a r X i v : . [ c s . L G ] M a y DD ’18, August 19–23, 2018, London, United Kingdom J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun major downside is that, [2] proposes to model all feature interac-tions, including both useful and useless combinations. As revealedin [43], the interactions with useless features may introduce noisesand degrade the performance. In recent years, deep neural networks(DNNs) have become successful in computer vision, speech recog-nition, and natural language processing with their great power offeature representation learning. It is promising to exploit DNNs tolearn sophisticated and selective feature interactions. [46] proposesa Factorisation-machine supported Neural Network (FNN) to learnhigh-order feature interactions. It uses the pre-trained factorizationmachines for field embedding before applying DNN. [31] furtherproposes a Product-based Neural Network (PNN), which introducesa product layer between embedding layer and DNN layer, and doesnot rely on pre-trained FM. The major downside of FNN and PNN isthat they focus more on high-order feature interactions while cap-ture little low-order interactions. The Wide&Deep [5] and DeepFM[9] models overcome this problem by introducing hybrid architec-tures, which contain a shallow component and a deep componentwith the purpose of learning both memorization and generalization.Therefore they can jointly learn low-order and high-order featureinteractions.All the abovementioned models leverage DNNs for learninghigh-order feature interactions. However, DNNs model high-orderfeature interactions in an implicit fashion. The final function learnedby DNNs can be arbitrary, and there is no theoretical conclusionon what the maximum degree of feature interactions is. In addition,DNNs model feature interactions at the bit-wise level, which isdifferent from the traditional FM framework which models featureinteractions at the vector-wise level. Thus, in the field of recom-mender systems, whether DNNs are indeed the most effective modelin representing high-order feature interactions remains an openquestion. In this paper, we propose a neural network-based modelto learn feature interactions in an explicit, vector-wise fashion.Our approach is based on the Deep & Cross Network (DCN) [40],which aims to efficiently capture feature interactions of boundeddegrees. However, we will argue in Section 2.3 that DCN will leadto a special format of interactions. We thus design a novel com-pressed interaction network (CIN) to replace the cross network inthe DCN. CIN learns feature interactions explicitly, and the degreeof interactions grows with the depth of the network. Followingthe spirit of the Wide&Deep and DeepFM models, we combinethe explicit high-order interaction module with implicit interac-tion module and traditional FM module, and name the joint modeleXtreme Deep Factorization Machine (xDeepFM). The new modelrequires no manual feature engineering and release data scientistsfrom tedious feature searching work. To summarize, we make thefollowing contributions: • We propose a novel model, named eXtreme Deep Factor-ization Machine (xDeepFM), that jointly learns explicit andimplicit high-order feature interactions effectively and re-quires no manual feature engineering. • We design a compressed interaction network (CIN) in xDeepFMthat learns high-order feature interactions explicitly. Weshow that the degree of feature interactions increases at eachlayer, and features interact at the vector-wise level ratherthan the bit-wise level. • We conduct extensive experiments on three real-world dataset,and the results demonstrate that our xDeepFM outperformsseveral state-of-the-art models significantly.The rest of this paper is organized as follows. Section 2 providessome preliminary knowledge which is necessary for understandingdeep learning-based recommender systems. Section 3 introducesour proposed CIN and xDeepFM model in detail. We will presentexperimental explorations on multiple datasets in Section 4. Relatedworks are discussed in Section 5. Section 6 concludes this paper.

In computer vision or natural language understanding, the inputdata are usually images or textual signals, which are known to bespatially and/or temporally correlated, so DNNs can be applieddirectly on the raw feature with dense structures. However, inweb-scale recommender systems, the input features are sparse, ofhuge dimension, and present no clear spatial or temporal corre-lation. Therefore, multi-field categorical form is widely used byrelated works [9, 31, 37, 40, 46]. For example, one input instance [user_id=s02,gender=male,organization=msra,interests=comedy&rock] is normally trans-formed into a high-dimensional sparse features via field-awareone-hot encoding: [ , , , , ..., (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) userid ] [ , (cid:124)(cid:123)(cid:122)(cid:125) дender ] [ , , , , ..., (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) orдanization ] [ , , , , ..., (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) interests ] An embedding layer is applied upon the raw feature input to com-press it to a low dimensional, dense real-value vector. If the fieldis univalent, the feature embedding is used as the field embed-ding. Take the above instance as an example, the embedding offeature male is taken as the embedding of field gender . If the fieldis multivalent, the sum of feature embedding is used as the fieldembedding. The embedding layer is illustrated in Figure 1. Theresult of embedding layer is a wide concatenated vector: e = [ e , e , ..., e m ] where m denotes the number of fields, and e i ∈ R D denotes theembedding of one field. Although the feature lengths of instancescan be various, their embeddings are of the same length m × D ,where D is the dimension of field embedding. Figure 1: The field embedding layer. The dimension of em-bedding in this example is 4. ombining Explicit and Implicit Feature Interactions for Recommender Systems KDD ’18, August 19–23, 2018, London, United Kingdom

FNN [46], Deep Crossing [37], and the deep part in Wide&Deep [5]exploit a feed-forward neural network on the field embedding vec-tor e to learn high-order feature interactions. The forward processis : x = σ ( W ( ) e + b ) (1) x k = σ ( W ( k ) x ( k − ) + b k ) (2)where k is the layer depth, σ is an activation function, and x k isthe output of the k -th layer. The visual structure is very similarto what is shown in Figure 2, except that they do not include the FM or Product layer . This architecture models the interaction in abit-wise fashion. That is to say, even the elements within the samefield embedding vector will influence each other.PNN [31] and DeepFM [9] modify the above architecture slightly.Besides applying DNNs on the embedding vector e , they add a two-way interaction layer in the architecture. Therefore, both bit-wiseand vector-wise interaction is included in their model. The majordifference between PNN and DeepFM, is that PNN connects theoutputs of product layer to the DNNs, whereas DeepFM connectsthe FM layer directly to the output unit (refer to Figure 2). Figure 2: The architecture of DeepFM (with linear part omit-ted) and PNN. We re-use the symbols in [9], where red edgesrepresent weight-1 connections (no parameters) and grayedges represent normal connections (network parameters). [40] proposes the Cross Network (CrossNet) whose architecture isshown in Figure 3. It aims to explicitly model the high-order fea-ture interactions. Unlike the classical fully-connected feed-forwardnetwork, the hidden layers are calculated by the following crossoperation: x k = x x Tk − w k + b k + x k − (3)where w k , b k , x k ∈ R mD are weights, bias and output of the k -thlayer, respectively. We argue that the CrossNet learns a special typeof high-order feature interactions, where each hidden layer in theCrossNet is a scalar multiple of x .Theorem 2.1. Consider a k -layer cross network with the (i+1)-thlayer defined as x i + = x x Ti w i + + x i . Then, the output of the crossnetwork x k is a scalar multiple of x . Figure 3: The architecture of the Cross Network.

Proof. When k =1, according to the associative law and distribu-tive law for matrix multiplication, we have: x = x ( x T w ) + x = x ( x T w + ) = α x (4)where the scalar α = x T w + x . Thus, x is a scalar multiple of x . Suppose the scalar multiplestatement holds for k = i . For k = i +

1, we have : x i + = x x Ti w i + + x i = x (( α i x ) T w i + ) + α i x = α i + x (5)where, α i + = α i ( x T w i + + ) is a scalar. Thus x i + is still a scalarmultiple of x . By induction hypothesis, the output of cross network x k is a scalar multiple of x . □ Note that the scalar multiple does not mean x k is linear with x .The coefficient α i + is sensitive with x . The CrossNet can learnfeature interactions very efficiently (the complexity is negligiblecompared with a DNN model), however the downsides are: (1) theoutput of CrossNet is limited in a special form, with each hiddenlayer is a scalar multiple of x ; (2) interactions come in a bit-wisefashion. We design a new cross network, named Compressed InteractionNetwork (CIN), with the following considerations: (1) interactionsare applied at vector-wise level, not at bit-wise level; (2) high-orderfeature interactions is measured explicitly; (3) the complexity ofnetwork will not grow exponentially with the degree of interactions.Since an embedding vector is regarded as a unit for vector-wiseinteractions, hereafter we formulate the output of field embeddingas a matrix X ∈ R m × D , where the i -th row in X is the embeddingvector of the i -th field: X i , ∗ = e i , and D is the dimension of the fieldembedding. The output of the k -th layer in CIN is also a matrix X k ∈ R H k × D , where H k denotes the number of (embedding) featurevectors in the k -th layer and we let H = m . For each layer, X k are DD ’18, August 19–23, 2018, London, United Kingdom J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (a) Outer products along each dimension forfeature interactions. The tensor Z k + is an in-termediate result for further learning. (b) The k -th layer of CIN. It compresses theintermediate tensor Z k + to H k + embeddingvectors (aslo known as feature maps ). (c) An overview of the CIN architecture. Figure 4: Components and architecture of the Compressed Interaction Network (CIN). calculated via: X kh , ∗ = H k − (cid:213) i = m (cid:213) j = W k , hij ( X k − i , ∗ ◦ X j , ∗ ) (6)where 1 ≤ h ≤ H k , W k , h ∈ R H k − × m is the parameter matrix forthe h -th feature vector, and ◦ denotes the Hadamard product, forexample, ⟨ a , a , a ⟩◦⟨ b , b , b ⟩ = ⟨ a b , a b , a b ⟩ . Note that X k is derived via the interactions between X k − and X , thus featureinteractions are measured explicitly and the degree of interactionsincreases with the layer depth. The structure of CIN is very similarto the Recurrent Neural Network (RNN), where the outputs of thenext hidden layer are dependent on the last hidden layer and anadditional input. We hold the structure of embedding vectors at alllayers, thus the interactions are applied at the vector-wise level.It is interesting to point out that Equation 6 has strong connec-tions with the well-known Convolutional Neural Networks (CNNs)in computer vision. As shown in Figure 4a, we introduce an in-termediate tensor Z k + , which is the outer products (along eachembedding dimension) of hidden layer X k and original feature ma-trix X . Then Z k + can be regarded as a special type of image and W k , h is a filter. We slide the filter across Z k + along the embeddingdimension (D) as shown in Figure 4b, and get an hidden vector X k + i , ∗ , which is usually called a feature map in computer vision.Therefore, X k is a collection of H k different feature maps . The term“ compressed " in the name of CIN indicates that the k -th hidden layercompress the potential space of H k − × m vectors down to H k vec-tors.Figure 4c provides an overview of the architecture of CIN. Let Tdenotes the depth of the network. Every hidden layer X k , k ∈ [ , T ] has a connection with output units. We first apply sum pooling oneach feature map of the hidden layer: p ki = D (cid:213) j = X ki , j (7)for i ∈ [ , H k ] . Thus, we have a pooling vector p k = [ p k , p k , ..., p kH k ] with length H k for the k -th hidden layer. All pooling vectors from hidden layers are concatenated before connected to output units: p + = [ p , p , ..., p T ] ∈ R (cid:205) Ti = H i . If we use CIN directly for binaryclassification, the output unit is a sigmoid node on p + : y = + exp ( p + T w o ) (8)where w o are the regression parameters. We analyze the proposed CIN to study the model complexity andthe potential effectiveness.

The h -th feature map at the k -th layercontains H k − × m parameters, which is exactly the size of W k , h .Thus, there are H k × H k − × m parameters at the k -th layer. Con-sidering the last regression layer for the output unit, which has (cid:205) Tk = H k parameters, the total number of parameters for CIN is (cid:205) Tk = H k × ( + H k − × m ) . Note that CIN is independent of theembedding dimension D . In contrast, a plain T -layers DNN contains m × D × H + H T + (cid:205) Tk = H k × H k − parameters, and the numberof parameters will increase with the embedding dimension D .Usually m and H k will not be very large, so the scale of W k , h isacceptable. When necessary, we can exploit a L -order decomposi-tion and replace W k , h with two smaller matrices U k , h ∈ R H k − × L and V k , h ∈ R m × L : W k , h = U k , h ( V k , h ) T (9)where L ≪ H and L ≪ m . Hereafter we assume that each hiddenlayer has the same number (which is H ) of feature maps for sim-plicity. Through the L -order decomposition, the space complexityof CIN is reduced from O ( mT H ) to O ( mT HL + T H L ) . In contrast,the space complexity of the plain DNN is O ( mDH + T H ) , whichis sensitive to the dimension (D) of field embedding. The cost of computing tensor Z k + (asshown in Figure 4a) is O ( mHD ) time. Because we have H fea-ture maps in one hidden layer, computing a T -layers CIN takes O ( mH DT ) time. A T -layers plain DNN, by contrast, takes O ( mHD + ombining Explicit and Implicit Feature Interactions for Recommender Systems KDD ’18, August 19–23, 2018, London, United Kingdom H T ) time. Therefore, the major downside of CIN lies in the timecomplexity. Next we examine the high-order interaction properties of CIN. For simplicity, we assume thatnumbers of feature maps at hidden layers are all equal to the numberof fields m . Let [ m ] denote the set of positive integers that are lessthan or equal to m . The h -th feature map at the first layer, denotedas x h ∈ R D , is calculated via: x h = (cid:213) i ∈[ m ] j ∈[ m ] W , hi , j ( x i ◦ x j ) (10)Therefore, each feature map at the first layer models pair-wiseinteractions with O ( m ) coefficients. Similarly, the h -th featuremap at the second layer is: x h = (cid:213) i ∈[ m ] j ∈[ m ] W , hi , j ( x i ◦ x j ) = (cid:213) i ∈[ m ] j ∈[ m ] (cid:213) l ∈[ m ] k ∈[ m ] W , hi , j W , il , k ( x j ◦ x k ◦ x l ) (11)Note that all calculations related to the subscript l and k is alreadyfinished at the previous hidden layer. We expand the factors inEquation 11 just for clarity. We can observe that each feature mapat the second layer models 3-way interactions with O ( m ) newparameters.A classical k -order polynomial has O ( m k ) coefficients. We showthat CIN approximate this class of polynomial with only O ( km ) parameters in terms of a chain of feature maps. By induction hy-pothesis, we can prove that the h -th feature map at the k -th layeris: x kh = (cid:213) i ∈[ m ] j ∈[ m ] W k , hi , j ( x k − i ◦ x j ) = (cid:213) i ∈[ m ] j ∈[ m ] ... (cid:213) r ∈[ m ] t ∈[ m ] (cid:213) l ∈[ m ] s ∈[ m ] W k , hi , j ... W , rl , s ( x j ◦ ... ◦ x s ◦ x l (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) k vectors ) (12)For better illustration, here we borrow the notations from [40]. Let α = [ α , ..., α m ] ∈ N d denote a multi-index, and | α | = (cid:205) mi = α i . Weomit the original superscript from x i , and use x i to denote it sincewe only we the feature maps from the 0-th layer (which is exactlythe field embeddings) for the final expanded expression (refer to Eq.12). Now a superscript is used to denote the vector operation, suchas x i = x i ◦ x i ◦ x i . Let V P k ( X ) denote a multi-vector polynomialof degree k : V P k ( X ) = (cid:40) (cid:213) α w α x α ◦ x α ◦ ... ◦ x α m m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ⩽ | α | ⩽ k (cid:41) (13)Each vector polylnomial in this class has O ( m k ) coefficients. Then,our CIN approaches the coefficient w α with:ˆ w α = m (cid:213) i = m (cid:213) j = (cid:213) B ∈ P α | α | (cid:214) t = W t , ji , B t (14) where, B = [ B , B , ..., B | α | ] is a multi-index, and P α is the set ofall the permutations of the indices ( , ... (cid:124)(cid:123)(cid:122)(cid:125) α times , ..., m , ..., m (cid:124) (cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32) (cid:125) α m times ) . As discussed in Section 2.2, plain DNNs learn implicit high-orderfeature interactions. Since CIN and plain DNNs can complementeach other, an intuitive way to make the model stronger is to com-bine these two structures. The resulting model is very similar tothe Wide&Deep or DeepFM model. The architecture is shown inFigure 5. We name the new model eXtreme Deep FactorizationMachine (xDeepFM), considering that on one hand, it includes bothlow-order and high-order feature interactions; on the other hand,it includes both implicit feature interactions and explicit featureinteractions. Its resulting output unit becomes:ˆ y = σ ( w Tlinear a + w Tdnn x kdnn + w Tcin p + + b ) (15)where σ is the sigmoid function, a is the raw features. x kdnn , p + arethe outputs of the plain DNN and CIN, respectively. w ∗ and b arelearnable parameters. For binary classifications, the loss function isthe log loss: L = − N N (cid:213) i = y i loд ˆ y i + ( − y i ) loд ( − ˆ y i ) (16)where N is the total number of training instances. The optimizationprocess is to minimize the following objective function: J = L + λ ∗ || Θ || (17)where λ ∗ denotes the regularization term and Θ denotes the set ofparameters, including these in linear part, CIN part, and DNN part. Figure 5: The architecture of xDeepFM.

Suppose all fields areunivalent. It’s not hard to observe from Figure 5 that, when thedepth and feature maps of the CIN part are both set to 1, xDeepFMis a generalization of DeepFM by learning the linear regressionweights for the FM layer (note that in DeepFM, units of FM layerare directly linked to the output unit without any coefficients).When we further remove the DNN part, and at the same time use aconstant sum filter (which simply takes the sum of inputs withoutany parameter learning) for the feature map, then xDeepFM isdowngraded to the traditional FM model.

DD ’18, August 19–23, 2018, London, United Kingdom J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun

In this section, we conduct extensive experiments to answer thefollowing questions: • (Q1) How does our proposed CIN perform in high-orderfeature interactions learning? • (Q2) Is it necessary to combine explicit and implicit high-order feature interactions for recommender systems? • (Q3) How does the settings of networks influence the per-formance of xDeepFM?We will answer these questions after presenting some fundamentalexperimental settings.

We evaluate our proposed models on the fol-lowing three datasets:

1. Criteo Dataset . It is a famous industry benchmarking datasetfor developing models predicting ad click-through rate, and is pub-licly accessible . Given a user and the page he is visiting, the goalis to predict the probability that he will clik on a given ad.

2. Dianping Dataset . Dianping.com is the largest consumer re-view site in China. It provides diverse functions such as reviews,check-ins, and shops’ meta information (including geographicalmessages and shop attributes). We collect 6 months’ users check-in activities for restaurant recommendation experiments. Givena user’s profile, a restaurant’s attributes and the user’s last threevisited POIs (point of interest), we want to predict the probabilitythat he will visit the restaurant. For each restaurant in a user’scheck-in instance, we sample four restaurants which are within 3kilometers as negative instances by POI popularity.

3. Bing News Dataset . Bing News is part of Microsoft’s Bingsearch engine. In order to evaluate the performance of our modelin a real commercial dataset, we collect five consecutive days’ im-pression logs on news reading service. We use the first three days’data for training and validation, and the next two days for testing.For the Criteo dataset and the Dianping dataset, we randomlysplit instances by 8:1:1 for training , validation and test. The char-acteristics of the three datasets are summarized in Table 1. Table 1: Statistics of the evaluation datasets. M indicates mil-lion and K indicates thousand.

Datasest

We use two metrics for model evalua-tion:

AUC (Area Under the ROC curve) and

Logloss (cross entropy).These two metrics evaluate the performance from two differentangels: AUC measures the probability that a positive instance willbe ranked higher than a randomly chosen negative one. It onlytakes into account the order of predicted instances and is insensi-tive to class imbalance problem. Logloss, in contrast, measures the http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ distance between the predicted score and the true label for eachinstance. Sometimes we rely more on Logloss because we need touse the predicted probability to estimate the benefit of a rankingstrategy (which is usually adjusted as CTR × bid). We compare our xDeepFM with LR(logistic re-gression), FM, DNN (plain deep neural network), PNN (choose thebetter one from iPNN and oPNN) [31], Wide & Deep [5], DCN (Deep& Cross Network) [40] and DeepFM [9]. As introduced and dis-cussed in Section 2, these models are highly related to our xDeepFMand some of them are state-of-the-art models for recommender sys-tems. Note that the focus of this paper is to learn feature interactionsautomatically, so we do not include any hand-crafted cross features.

We implement our method using Tensor-flow . Hyper-parameters of each model are tuned by grid-searchingon the validation set, and the best settings for each model will beshown in corresponding sections. Learning rate is set to . Foroptimization method, we use the Adam [16] with a mini-batch sizeof 4096. We use a L2 regularization with λ = . Tesla K80 GPUs .The source code is available at https://github.com/Leavingseason/xDeepFM . Table 2: Performance of individual models on the Criteo, Di-anping, and Bing News datasets. Column

Depth indicates thebest network depth for each model.

Model name AUC Logloss DepthCriteoFM 0.7900 0.4592 -DNN 0.7993 0.4491 2CrossNet 0.7961 0.4508 3CIN ombining Explicit and Implicit Feature Interactions for Recommender Systems KDD ’18, August 19–23, 2018, London, United Kingdom Table 3: Overall performance of different models on Criteo, Dianping and Bing News datasets. The column

Depth presents thebest setting for network depth with a format of (cross layers, DNN layers).

Criteo Dianping Bing NewsModel name AUC Logloss Depth AUC Logloss Depth AUC Logloss DepthLR 0.7577 0.4854 -,- 0.8018 0.3608 -,- 0.7988 0.2950 -,-FM 0.7900 0.4592 -,- 0.8165 0.3558 -,- 0.8223 0.2779 -,-DNN 0.7993 0.4491 -,2 0.8318 0.3382 -,3 0.8366 0.2730 -,2DCN 0.8026 0.4467 2,2 0.8391 0.3379 4,3 0.8379 0.2677 2,2Wide&Deep 0.8000 0.4490 -,3 0.8361 0.3364 -,2 0.8377 0.2668 -,2PNN 0.8038 0.4927 -,2 0.8445 0.3424 -,3 0.8321 0.2775 -,3DeepFM 0.8025 0.4468 -,2 0.8481 0.3333 -,2 0.8376 0.2671 -,3xDeepFM

We want to know how CIN performs individually. Note that FMmeasures 2-order feature interactions explicitly, DNN model high-order feature interactions implicitly, CrossNet tries to model high-order feature interactions with a small number of parameters (whichis proven not effective in Section 2.3), and CIN models high-orderfeature interactions explicitly. There is no theoretic guarantee ofthe superiority of one individual model over the others, due to thatit really depends on the dataset. For example, if the practical datasetdoes not require high-order feature interactions, FM may be thebest individual model. Thus we do not have any expectation forwhich model will perform the best in this experiment.Table 2 shows the results of individual models on the three prac-tical datasets. Surprisingly, our CIN outperform the other modelsconsistently. On one hand, the results indicate that for practicaldatasets, higher-order interactions over sparse features are neces-sary, and this can be verified through the fact that DNN, CrossNetand CIN outperform FM significantly on all the three datasets. Onthe other hand, CIN is the best individual model, which demon-strates the effectiveness of CIN on modeling explicit high-orderfeature interactions. Note that a k -layer CIN can model k -degreefeature interactions. It is also interesting to see that it take 5 layersfor CIN to yield the best result ON the Bing News dataset. xDeepFM integrates CIN and DNN into an end-to-end model. WhileCIN and DNN covers two distinct properties in learning featureinteractions, we are interested to know whether it is indeed neces-sary and effective to combine them together for jointly explicit andimplicit learning. Here we compare several strong baselines whichare not limited to individual models, and the results are shown inTable 3. We observe that LR is far worse than all the rest models,which demonstrates that factorization-based models are essentialfor measuring sparse features. Wide&Deep, DCN, DeepFM andxDeepFM are significantly better than DNN, which directly reflectsthat, despite their simplicity, incorporating hybrid components areimportant for boosting the accuracy of predictive systems. Ourproposed xDeepFM achieves the best performance on all datasets,which demonstrates that combining explicit and implicit high-order feature interaction is necessary, and xDeepFM is effective in learn-ing this class of combination. Another interesting observation isthat, all the neural-based models do not require a very deep net-work structure for the best performance. Typical settings for thedepth hyper-parameter are 2 and 3, and the best depth setting forxDeepFM is 3, which indicates that the interactions we learned areat most 4-order. We study the impact of hyper-parameters on xDeepFM in this sec-tion, including (1) the number of hidden layers; (2) the number ofneurons per layer; and (3) activation functions. We conduct experi-ments via holding the best settings for the DNN part while varyingthe settings for the CIN part.

Depth of Network . Figure 6a and 7a demonstrate the impactof number of hidden layers. We can observe that the performanceof xDeepFM increases with the depth of network at the beginning.However, model performance degrades when the depth of networkis set greater than 3. It is caused by overfitting evidenced by thatwe notice that the loss of training data still keeps decreasing whenwe add more hidden layers.

Number of Neurons per Layer . Adding the number of neu-rons per layer indicates increasing the number of feature maps inCIN. As shown in Figure 6b and 7b, model performance on BingNews dataset increases steadily when we increase the number ofneurons from 20 to 200, while on Dianping dataset, 100 is a moresuitable setting for the number of neurons per layer. In this experi-ment we fix the depth of network at 3.

Activation Function . Note that we exploit the identity as acti-vation function on neurons of CIN, as shown in Eq. 6. A commonpractice in deep learning literature is to employ non-linear acti-vation functions on hidden neurons. We thus compare the resultsof different activation functions on CIN (for neurons in DNN, wekeep the activation function with relu ). As shown in Figure 6c and7c, identify function is indeed the most suitable one for neurons inCIN.

DD ’18, August 19–23, 2018, London, United Kingdom J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun number of layers A U C o n D i a n p i n g Dianping A U C o n B i n g N e w s Bing News (a) Number of layers.

20 40 100 200 number of neurons per layer A U C o n D i a n p i n g Dianping A U C o n B i n g N e w s Bing News (b) Number of neurons per layer. sigmoid tanh relu identity activation functions A U C o n D i a n p i n g Dianping A U C o n B i n g N e w s Bing News (c) Activation functions

Figure 6: Impact of network hyper-parameters on AUC performance. number of layers L o g l o ss o n D i a n p i n g Dianping L o g l o ss o n B i n g N e w s Bing News (a) Number of layers.

20 40 100 200 number of neurons per layer L o g l o ss o n D i a n p i n g Dianping L o g l o ss o n B i n g N e w s Bing News (b) Number of neurons per layer. sigmoid tanh relu identity activation functions L o g l o ss o n D i a n p i n g Dianping L o g l o ss o n B i n g N e w s Bing News (c) Activation functions

Figure 7: Impact of network hyper-parameters on Logloss performance.

For web-scale recommendersystems (RSs), the input features are usually sparse, categorical-continuous-mixed, and high-dimensional. Linear models, such aslogistic regression with

FTRL [27], are widely adopted as they areeasy to manage, maintain, and deploy. Because linear models lackthe ability of learning feature interactions, data scientists have tospend a lot of work on engineering cross features in order to achievebetter performance [22, 35]. Considering that some hidden featuresare hard to design manually, some researchers exploit boostingdecision trees to help build feature transformations [14, 25].

A major downside of the aforemen-tioned models is that they can not generalize to unseen featureinteractions in the training set. Factorization Machines [32] over-come this problem via embedding each feature into a low dimensionlatent vector. Matrix factorization (MF) [18], which only considersIDs as features, can be regarded as a special kind of FM. Recom-mendations are made via the product of two latent vectors, thus itdoes not require the co-occurrence of user and item in the train-ing set. MF is the most popular model-based collaborative filteringmethod in the RS literature [17, 20, 30, 38]. [4, 28] extend MF toleveraging side information, in which both a linear model and aMF model are included. On the other hand, for many recommendersystems, only implicit feedback datasets such as users’ watchinghistory and browsing activities are available. Thus researchers ex-tend the factorization models to a Bayesian Personalized Ranking(BPR) framework [11, 33, 34, 44] for implicit feedback.

Deep learning techniques have achieved great success in computervision [10, 19], speech recognition [1, 15] and natural language un-derstanding [6, 29]. As a result, an increasing number of researchersare interested in employing DNNs for recommender systems.

To avoid man-ually building up high-order cross features, researchers apply DNNson field embedding, thus patterns from categorical feature inter-actions can be learned automatically. Representative models in-clude FNN [46], PNN [31], DeepCross [37], NFM [12], DCN [40],Wide&Deep [5], and DeepFM [9]. These models are highly relatedto our proposed xDeepFM. Since we have reviewed them in Sec-tion 1 and Section 2, we do not further discuss them in detail inthis section. We have demonstrated that our proposed xDeepFMhas two special properties in comparison with these models: (1)xDeepFM learns high-order feature interactions in both explicitand implicit fashions; (2) xDeepFM learns feature interactions atthe vector-wise level rather than at the bit-wise level.

Weinclude some other deep learning-based RSs in this section due tothat they are less focused on learning feature interactions. Someearly work employs deep learning mainly to model auxiliary in-formation, such as visual data [11] and audio data [41]. Recently,deep neural networks are used to model the collaborative filtering(CF) in RSs. [13] proposes a Neural Collaborative Filtering (NCF)so that the inner product in MF can be replaced with an arbitraryfunction via a neural architecture. [36, 42] model CF base on theautoencoder paradigm, and they have empirically demonstratedthat autoencoder-based CF outperforms several classical MF mod-els. Autoencoders can be further employed for jointly modeling ombining Explicit and Implicit Feature Interactions for Recommender Systems KDD ’18, August 19–23, 2018, London, United Kingdom

CF and side information with the purpose of generating better la-tent factors [7, 39, 45]. [8, 23] employ neural networks to jointlytrain multiple domains’ latent factors. [3] proposes the AttentiveCollaborative Filtering (ACF) to learn more elaborate preferenceat both item-level and component-level. [47] shows tha traditionalRSs can not capture interest diversity and local activation effectively,so they introduce a Deep Interest Network (DIN) to represent users’diverse interests with an attentive activation mechanism.

In this paper, we propose a novel network named Compressed In-teraction Network (CIN), which aims to learn high-order featureinteractions explicitly. CIN has two special virtues: (1) it can learncertain bounded-degree feature interactions effectively; (2) it learnsfeature interactions at a vector-wise level. Following the spirit ofseveral popular models, we incorporate a CIN and a DNN in anend-to-end framework, and named the resulting model eXtremeDeep Factorization Machine (xDeepFM). Thus xDeepFM can auto-matically learn high-order feature interactions in both explicit andimplicit fashions, which is of great significance to reducing manualfeature engineering work. We conduct comprehensive experimentsand the results demonstrate that our xDeepFM outperforms state-of-the-art models consistently on three real-world datasets.There are two directions for future work. First, currently wesimply employ a sum pooling for embedding multivalent fields.We can explore the usage of the DIN mechanism [47] to capturethe related activation according to the candidate item. Second, asdiscussed in section 3.2.2, the time complexity of the CIN moduleis high. We are interested in developing a distributed version ofxDeepFM which can be trained efficiently on a GPU cluster.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewers fortheir insightful reviews, which are very helpful on the revisionof this paper. This work is supported in part by Youth InnovationPromotion Association of CAS.

REFERENCES [1] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, EricBattenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, GuoliangChen, et al. 2016. Deep speech 2: End-to-end speech recognition in english andmandarin. In

International Conference on Machine Learning . 173–182.[2] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. 2016.Higher-order factorization machines. In

Advances in Neural Information Process-ing Systems . 3351–3359.[3] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendationwith item-and component-level attention. In

Proceedings of the 40th InternationalACM SIGIR conference on Research and Development in Information Retrieval . ACM,335–344.[4] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and YongYu. 2012. SVDFeature: a toolkit for feature-based collaborative filtering.

Journalof Machine Learning Research

13, Dec (2012), 3619–3622.[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In

Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems . ACM, 7–10.[6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014). [7] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, and Fangxi Zhang.2017. A Hybrid Collaborative Filtering Model with Deep Structure for Recom-mender Systems. In

AAAI . 1309–1315.[8] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deeplearning approach for cross domain user modeling in recommendation systems. In

Proceedings of the 24th International Conference on World Wide Web . InternationalWorld Wide Web Conferences Steering Committee, 278–288.[9] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.Deepfm: A factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247 (2017).[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.[11] Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian PersonalizedRanking from Implicit Feedback. In

AAAI . 144–150.[12] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparsepredictive analytics. In

Proceedings of the 40th International ACM SIGIR conferenceon Research and Development in Information Retrieval . ACM, 355–364.[13] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In

Proceedings of the 26th InternationalConference on World Wide Web . International World Wide Web ConferencesSteering Committee, 173–182.[14] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predictingclicks on ads at facebook. In

Proceedings of the Eighth International Workshop onData Mining for Online Advertising . ACM, 1–9.[15] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara NSainath, et al. 2012. Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups.

IEEE Signal ProcessingMagazine

29, 6 (2012), 82–97.[16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[17] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifacetedcollaborative filtering model. In

Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 426–434.[18] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

42, 8 (2009).[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neural informationprocessing systems . 1097–1105.[20] Joonseok Lee, Seungyeon Kim, Guy Lebanon, and Yoram Singer. 2013. Locallow-rank matrix approximation. In

International Conference on Machine Learning .82–90.[21] Jianxun Lian and Xing Xie. 2016. Cross-Device User Matching Based on MassiveBrowse Logs: The Runner-Up Solution for the 2016 CIKM Cup. arXiv preprintarXiv:1610.03928 (2016).[22] Jianxun Lian, Fuzheng Zhang, Min Hou, Hongwei Wang, Xing Xie, andGuangzhong Sun. 2017. Practical Lessons for Job Recommendations in theCold-Start Scenario. In

Proceedings of the Recommender Systems Challenge 2017(RecSys Challenge ’17) . ACM, New York, NY, USA, Article 4, 6 pages. https://doi.org/10.1145/3124791.3124794[23] Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2017. CCCFNet:a content-boosted collaborative filtering neural network for cross domain rec-ommender systems. In

Proceedings of the 26th International Conference on WorldWide Web Companion . International World Wide Web Conferences SteeringCommittee, 817–818.[24] Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2017. RestaurantSurvival Analysis with Heterogeneous Information. In

Proceedings of the 26thInternational Conference on World Wide Web Companion . International WorldWide Web Conferences Steering Committee, 993–1002.[25] Xiaoliang Ling, Weiwei Deng, Chen Gu, Hucheng Zhou, Cui Li, and Feng Sun.2017. Model Ensemble for Click Prediction in Bing Search Ads. In

Proceedings ofthe 26th International Conference on World Wide Web Companion . InternationalWorld Wide Web Conferences Steering Committee, 689–698.[26] Guimei Liu, Tam T Nguyen, Gang Zhao, Wei Zha, Jianbo Yang, Jianneng Cao, MinWu, Peilin Zhao, and Wei Chen. 2016. Repeat buyer prediction for e-commerce.In

Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . ACM, 155–164.[27] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013.Ad click prediction: a view from the trenches. In

Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining . ACM,1222–1230.[28] Aditya Krishna Menon and Charles Elkan. 2010. A log-linear model with latentfeatures for dyadic prediction. In

Data Mining (ICDM), 2010 IEEE 10th InternationalConference on . IEEE, 364–373.

DD ’18, August 19–23, 2018, London, United Kingdom J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun [29] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černock`y, and Sanjeev Khu-danpur. 2010. Recurrent neural network based language model. In

EleventhAnnual Conference of the International Speech Communication Association .[30] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz,and Qiang Yang. 2008. One-class collaborative filtering. In

Data Mining, 2008.ICDM’08. Eighth IEEE International Conference on . IEEE, 502–511.[31] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang.2016. Product-based neural networks for user response prediction. In

Data Mining(ICDM), 2016 IEEE 16th International Conference on . IEEE, 1149–1154.[32] Steffen Rendle. 2010. Factorization machines. In

Data Mining (ICDM), 2010 IEEE10th International Conference on . IEEE, 995–1000.[33] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In

Proceedingsof the twenty-fifth conference on uncertainty in artificial intelligence . AUAI Press,452–461.[34] Steffen Rendle and Lars Schmidt-Thieme. 2010. Pairwise interaction tensorfactorization for personalized tag recommendation. In

Proceedings of the thirdACM international conference on Web search and data mining . ACM, 81–90.[35] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predictingclicks: estimating the click-through rate for new ads. In

Proceedings of the 16thinternational conference on World Wide Web . ACM, 521–530.[36] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.Autorec: Autoencoders meet collaborative filtering. In

Proceedings of the 24thInternational Conference on World Wide Web . ACM, 111–112.[37] Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016.Deep crossing: Web-scale modeling without manually crafted combinatorialfeatures. In

Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining . ACM, 255–262.[38] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. 2005. Maximum-marginmatrix factorization. In

Advances in neural information processing systems . 1329–1336. [39] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learningfor recommender systems. In

Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining . ACM, 1235–1244.[40] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Networkfor Ad Click Predictions. arXiv preprint arXiv:1708.05123 (2017).[41] Xinxi Wang and Ye Wang. 2014. Improving content-based and hybrid music rec-ommendation using deep learning. In

Proceedings of the 22nd ACM internationalconference on Multimedia . ACM, 627–636.[42] Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collabora-tive denoising auto-encoders for top-n recommender systems. In

Proceedings ofthe Ninth ACM International Conference on Web Search and Data Mining . ACM,153–162.[43] Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua.2017. Attentional Factorization Machines: Learning the Weight of Feature Inter-actions via Attention Networks. In

Proceedings of the Twenty-Sixth InternationalJoint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August19-25, 2017 . 3119–3125. https://doi.org/10.24963/ijcai.2017/435[44] Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and WeinanZhang. 2016. Lambdafm: learning optimal ranking with factorization machinesusing lambda surrogates. In

Proceedings of the 25th ACM International on Confer-ence on Information and Knowledge Management . ACM, 227–236.[45] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma.2016. Collaborative knowledge base embedding for recommender systems. In

Proceedings of the 22nd ACM SIGKDD international conference on knowledgediscovery and data mining . ACM, 353–362.[46] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-fieldcategorical data. In

European conference on information retrieval . Springer, 45–57.[47] Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, XingyaDai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. 2017. Deep interest network forclick-through rate prediction. arXiv preprint arXiv:1706.06978arXiv preprint arXiv:1706.06978